mcptest docs GitHub

Model compatibility: a rollout gate for MCP integrations

You ship an MCP server. Your customers connect Claude, GPT, Gemini, and a handful of in-house models to it. One Tuesday your model vendor pushes a silent quality update. A tool stops firing on a phrasing the old model understood. Latency drifts up because the new model retries more often. A JSON field that used to come back as a string is now a number. Your test suite is green. Your support inbox is not.

This is the gap mcptest model-compat closes. Capture a baseline run against a known-good model. Run a candidate against the new model. Run a diff that classifies every difference as PASS, DRIFT, or FAIL. Gate the rollout on the result. Three commands and one YAML file.

This guide explains the workflow, the classification model, and when the diff is the right tool for the job. The fixture corpus the diff engine is tested against lives at tests/fixtures/model-compat/ and is worth reading once to understand what each classification looks like in practice.

The compatibility runner is the v1.1 wedge feature. The CLI surface below is the shape the W8 implementation lands on; v1.0 already records the metadata under model_compatibility: and the diff engine is in flight. Follow the W8 milestone on GitHub for status.

The problem in one paragraph

The model is part of your application. When you upgrade a model the way you upgrade a library, you need the same kind of CI gate a library bump gets: "does the new version do the things the old one did?" For an MCP integration, "do the things" means call the right tools with the right arguments in the right order, in a response shape clients can parse. None of that is visible to a function-level unit test. The model decides, the server reacts, the difference between the two runs is the regression you care about.

The workflow

The full loop is four steps. Two of them live in CI, two of them are one-time setup.

1. Capture a baseline

Run your existing suite against the current production model. mcptest records every tool call, every tool argument, every response field, and every finish reason in a baseline.json snapshot.

mcptest model-compat capture \
    --model claude-opus-4-7 \
    --output baselines/claude-opus-4-7.json \
    tests/

Commit the baseline next to your tests. Keep one baseline per model you support. A multi-model suite ships with a small directory:

baselines/
  claude-opus-4-7.json
  claude-sonnet-4-5.json
  gpt-4o.json

2. Ship the new model behind a flag

Stage the model upgrade in your application code with whatever feature flag or environment variable you already use for rollouts. Production traffic still hits the old model. The candidate model is reachable only from CI.

3. Run model-compat in CI

A CI job runs the same suite against the candidate model, captures a candidate.json, and diffs it against the baseline.

mcptest model-compat verify \
    --baseline baselines/claude-opus-4-7.json \
    --candidate-model claude-opus-4-8 \
    --output compat-report.json \
    tests/

The verifier prints one of three classifications per assertion and rolls them up into a suite-level verdict:

The verifier exits non-zero on any FAIL. DRIFT is configurable: default is exit zero with a warning, --strict flips it to exit non-zero so a reviewer must sign off on every drift before the rollout proceeds.

4. Gate the rollout

A green model-compat verify job is the rollout gate. Promote the candidate to a fraction of traffic, watch the metrics dashboard for the first window, and ratchet up confidence on a schedule you control. The diff report attaches to the rollout ticket so the audit trail is complete: "we ran the suite against the new model, here is the diff, here is who approved the drift entries."

When the candidate becomes the new baseline (after the rollout settles), re-run the capture step against it. Commit the new baseline. The cycle repeats on the next model push.

When to use model compatibility

Use it when the cost of a silent regression is high:

When NOT to use it

The diff engine is not free, and it is not the right tool for every scenario:

Classifications in detail

Every difference the diff engine produces is one of three classifications. The fixture corpus encodes 18 representative scenarios; the table below is a curated subset. Read tests/fixtures/model-compat/README.md for the full list.

ClassificationExample scenarioWhy
PASSIdentical text and tool calls (01-identical).Byte-identical output.
PASSEmpty content and tool lists on both sides (18-empty-vs-empty).Both responses are "no action required."
DRIFTText rephrased (03-rephrasing-only).New phrasing, same meaning.
DRIFTJSON object key order reshuffled (05-tool-args-reordered).Structurally equivalent.
DRIFTA new optional response field appeared (11-response-shape-added-field).Additive.
DRIFTWhitespace or casing change (16-whitespace-only-diff, 17-case-only-diff).Surface-level only.
FAILTool argument value changed (07-tool-args-value-changed)."Send email to bob" instead of "send email to alice" routes the message to the wrong person.
FAILRequired tool not called (09-required-tool-not-called).The baseline called lookup_account; the candidate refused to.
FAILTool call order swapped (15-tool-call-order-swapped).Order is part of the contract.
FAILFinish reason changed (14-finish-reason-changed).stop vs max_tokens is the difference between "model finished" and "model ran out of room."
FAILResponse field removed or type-changed (12, 13).Clients downstream of the response will crash.

The diff engine's job is to land every observed difference in exactly one classification. The fixture corpus is the test list for that engine. Adding a scenario is documented in tests/fixtures/model-compat/README.md.

Worked example: a GitHub-Issues MCP server

The setup. Your team runs a small MCP server, github-issues, that exposes create_issue, list_issues, and lookup_account. Production agents use Claude Opus 4.7. You want to roll out a candidate model upgrade.

Step 1: capture the baseline

A small test suite runs the canonical flow: "find the customer's account, then file a triage issue against the right repository."

mcptest model-compat capture \
    --model claude-opus-4-7 \
    --output baselines/claude-opus-4-7.json \
    tests/triage-flow.yml

The recorded baseline.json captures, for one specific assertion, a shape like the fixture tests/fixtures/model-compat/02-identical-multi-tool/:

{
  "content": [
    { "type": "text", "text": "Looking up the account and filing the issue." }
  ],
  "tool_calls": [
    { "name": "lookup_account", "arguments": { "id": "acct-7" } },
    { "name": "create_issue",   "arguments": { "repo": "search-svc", "title": "Triage queue overflow", "labels": ["triage"] } }
  ],
  "finish_reason": "tool_use",
  "model": "claude-opus-4-7"
}

Step 2: run the candidate

In CI, a separate job runs the same suite against the candidate model (say, claude-opus-4-8). The runner records a candidate.json with the same shape, then invokes the diff engine.

mcptest model-compat verify \
    --baseline baselines/claude-opus-4-7.json \
    --candidate-model claude-opus-4-8 \
    --output compat-report.json \
    tests/triage-flow.yml

Step 3: read the diff

Three kinds of outcomes are possible. Each maps to a fixture in the corpus.

Outcome A: clean pass. The candidate produces the same two tool calls with the same arguments. The expected.yaml shape mirrors tests/fixtures/model-compat/02-identical-multi-tool/expected.yaml:

classification: PASS
rationale: |
  Text and both tool calls match byte for byte. Tool order matches.
  Model name differs as expected.
invariants_violated: []

Roll forward.

Outcome B: acceptable drift. The candidate rephrases the leading text ("Let me find the account and open a triage ticket.") but the tool calls match. The diff engine classifies this as DRIFT with drift_kind: semantic-equivalent. The reviewer sees the rephrasing, signs off, and the rollout proceeds. Without --strict, drift does not block CI.

Outcome C: a real regression. The candidate refuses to call lookup_account and returns a text-only response. The diff engine classifies this exactly as fixture tests/fixtures/model-compat/09-required-tool-not-called/expected.yaml:

classification: FAIL
rationale: |
  Baseline called `lookup_account`. Candidate refused to call it and
  returned a text-only response. Whether a tool gets called for a given
  user intent is the core behavioral invariant of an MCP server test;
  silently skipping the call is a hard failure.
invariants_violated:
  - tool_called.lookup_account

CI exits non-zero. The rollout stops. The reviewer files a ticket against the model vendor with the diff attached. The candidate stays out of production until either the model behavior is fixed or the test suite's expectation is renegotiated.

Step 4: the rollout

A clean run, or a drift run that a reviewer signed off on, becomes the gate that lets the candidate model into production traffic. The diff report (compat-report.json) is the artifact attached to the rollout ticket. The audit trail for "why did we promote this model" is complete: the baseline, the candidate, the diff, the approver.

Configuring the run

The runner reads three sources for what to test:

The expected.yaml files in the fixture corpus enumerate the canonical invariant names. Your suite does not have to list them by hand; the runner derives them from the assertions you already wrote.

Invariants

Invariants are the model-agnostic guardrails for a workflow. A surface diff alone tells you "the candidate phrased the response differently". An invariant tells you "the candidate failed to call the tool that has to fire for this workflow to work at all". The diff engine surfaces invariant failures as FAIL regardless of how the rest of the diff classified the pair.

Use invariants when the cost of a silent regression is high enough that a rephrased response is not the worst case you want to catch. A customer-support agent that stops calling lookup_account is broken whether or not the prose around the missing call is fluent.

Authoring

Declare invariants on a baseline or per-test. Each invariant has a name (used in the report), a kind (what to evaluate), and a condition (the payload):

invariants:
  - name: must-look-up-account
    kind: tool_called
    condition:
      tool: lookup_account
  - name: lookup-needs-id
    kind: arg_present
    condition:
      tool: lookup_account
      arg: id
  - name: mentions-account
    kind: response_field_present
    condition:
      text: "account"
  - name: latency-under-budget
    kind: latency_under_ms
    condition:
      ms: 800

Five kinds are supported in v1:

How they integrate with the diff

The diff engine evaluates every declared invariant against the candidate run and emits one entry per violation in the ChangeCategory::Invariant bucket. Every violation classifies as FAIL, so a passing surface diff with a single failed invariant still exits non-zero in CI. The full evaluation result (one row per invariant, pass or fail) lives in BaselineDiff::invariant_results so reporters can show a green check next to invariants that held.

When NOT to use invariants

CI snippet

A complete GitHub Actions step:

- name: Capture baseline (manually, once per release cycle)
  if: github.event_name == 'workflow_dispatch'
  run: |
    mcptest model-compat capture \
      --model claude-opus-4-7 \
      --output baselines/claude-opus-4-7.json \
      tests/

- name: Verify candidate against baseline
  run: |
    mcptest model-compat verify \
      --baseline baselines/claude-opus-4-7.json \
      --candidate-model claude-opus-4-8 \
      --output compat-report.json \
      tests/

- name: Upload report
  if: always()
  uses: actions/upload-artifact@v4
  with:
    name: compat-report
    path: compat-report.json

See also