Validate your AI model upgrade in one command

Status: draft for marketing review. Word count target 1500. Audience: AI and ML teams running MCP servers in production. Tone: editorial, concrete, no hype, no named-competitor comparisons.

The problem nobody puts in the release notes

Last Tuesday your model vendor pushed an update. Not a major version, not a deprecation, not anything the release-notes pipeline picked up. A silent quality refinement, the kind every vendor ships on a near-weekly cadence. Your test suite stayed green. Your latency dashboards stayed flat. Your support inbox lit up with three users reporting that a familiar workflow had stopped working.

You investigate. The MCP server you ship, the one a thousand customers have wired into their internal Claude and GPT and Gemini deployments, is returning the same JSON it always returned. The model is not. On a particular phrasing of a particular request, the new model decides not to call your lookup_account tool. It improvises a text answer. Your agent forwards the improvisation as if it were a confirmed lookup. The user acts on it. The user's bank account is debited the wrong amount.

This is the regression class that has no good answer in the current MCP testing toolbelt. Your function-level tests pass because your server's function-level behavior did not change. Your end-to-end smoke tests pass because the smoke prompts were not the ones that triggered the regression. The thing that changed is the model, and the model is the part of the system you do not own.

mcptest v1.1 ships the answer: mcptest model-compat, a diff-based gate that validates a candidate model against a recorded baseline before you roll it into production. One command, one report, one rollout decision.

The model is part of your application

The agent-platform community has been slow to acknowledge that the model is a dependency. Every other dependency in the stack has a version number, a release-notes feed, a CI gate that triggers when it bumps. The model has none of those. The version number on the API is a coarse label that hides a continuous stream of in-place quality updates. The release notes are blog posts written for ChatGPT users, not for your platform team. The CI gate does not exist, because the gate would have to compare behavior, not code, and there has been no good way to compare behavior.

If you ship an MCP server, you have already made a bet on a model. Your tools were designed with a specific model in mind. Your prompts were calibrated against that model's quirks. Your latency budgets, your token budgets, your retry logic, all of it was tuned to one model's distribution. When the vendor pushes a quality update, your bet either holds or it does not, and the only way to find out used to be to ship it and read the support tickets.

The argument we want to make in v1.1 is small but specific. Model behavior is the dependency you do not own. The way you handle a dependency you do not own is the same way you handle every other dependency you do not own: a CI gate that runs on every bump, against a representative baseline, with a clear pass-fail verdict.

What `mcptest model-compat` does

The CLI surface is three commands. Capture a baseline against the model you trust. Run a candidate against the new model. Diff the two and read the verdict.

mcptest model-compat capture \
    --model claude-opus-4-7 \
    --output baselines/claude-opus-4-7.json \
    tests/

mcptest model-compat verify \
    --baseline baselines/claude-opus-4-7.json \
    --candidate-model claude-opus-4-8 \
    --output compat-report.json \
    tests/

Behind those two commands is a diff engine that classifies every difference into one of three buckets:

PASS: the candidate produced byte-identical or semantically equivalent output. The release notes ride along; the rollout proceeds.
DRIFT: the candidate's output is different in a way that preserves behavior. Text was rephrased. JSON keys came back in a different order. An optional response field appeared that was not there before. Drift entries land in the report so a reviewer can decide whether the new phrasing is acceptable; the gate does not block CI by default.
FAIL: the candidate's output is different in a way that breaks behavior. A required tool was not called. An argument value changed. A response field was dropped. The finish reason flipped from tool_use to stop. The gate exits non-zero. The rollout stops.

Every classification has a worked test in the fixture corpus the engine ships with. Eighteen scenarios, all the way from "the candidate and baseline are byte-identical" to "the candidate routed an email to the wrong recipient." If you want to know what a DRIFT verdict looks like in practice, read the fixture; if you want to know what a FAIL verdict looks like, the fixture corpus has eight of them documented at tests/fixtures/model-compat/.

The rollout gate, end to end

A complete model rollout under v1.1 looks like four steps.

Step one: capture a baseline. Run your existing mcptest suite against the model you trust today. The CLI walks every assertion, records every tool call and every tool argument and every response field, and writes a JSON snapshot. Commit the snapshot next to your tests. One baseline per model you support. A team that runs three models (Claude, GPT, Gemini) ships a baselines/ directory with three files.

Step two: stage the candidate. When the vendor pushes a new model, or when you decide to evaluate a model from a different vendor, wire it behind whatever feature flag your application already uses for rollouts. Production keeps the old model. The candidate is reachable from CI.

Step three: run the diff. A CI job runs your existing suite against the candidate, captures a candidate snapshot, and invokes mcptest model-compat verify. The report lands as a JSON artifact attached to the build. Drift entries are listed with their classification and rationale. Fail entries are listed with the specific invariant they violated.

Step four: decide. A green run, or a drift run that a reviewer signed off on, is the rollout gate. Promote the candidate to a small slice of production traffic, watch the dashboards for the first window, and ratchet up confidence at whatever pace your platform team trusts. The compatibility report is the artifact attached to the rollout ticket. The audit trail for "why did we promote this model" is complete: the baseline, the candidate, the diff, the approver.

If the diff comes back FAIL, the rollout does not happen. The candidate stays out of production until either the model behavior is fixed, the model vendor responds to the ticket you file against them, or your test suite is updated to reflect a deliberate change in the expected behavior. None of those decisions get made on a Slack thread five minutes before a release; they get made against a written report that anyone can re-run.

Who this is for

If you ship an MCP server in production, the answer is "you." A more useful answer is the three audiences we built v1.1 against.

Platform teams running customer-facing assistants. A change in tool-call routing on a known phrasing is a customer-facing bug. The support volume from a silent model update is the kind of incident that gets a postmortem. The gate is the cheapest way to keep that postmortem off the runbook.

Regulated environments. Finance, healthcare, legal, government. The artifact your compliance team needs is the one this gate produces: a documented diff between the model you tested against and the model you are about to ship. A bonus for SOC 2 and HIPAA shops is that the artifact can be archived, signed, and cross-referenced from a change ticket.

Multi-model suites. Any team that supports a matrix of models has already discovered the hard way that one model's behavior is not another model's behavior. The matrix surface (Claude here, GPT there, Gemini for the customer that asked) only gets worse to maintain by hand. Running the diff against three baselines on every model push is the only way to keep the matrix honest.

Who this is not for

A useful tool needs an honest "do not use it for X" section. Three non-uses:

One-off experiments. A throwaway prompt comparison does not need a baseline. Diff the two responses by hand and move on.
Low-stakes prototypes. A hackathon project does not need a rollout gate.
Pure server-logic unit tests. If the assertion does not depend on the model, model compatibility has nothing to add. Use the existing mcptest run flow.

The line between "use the gate" and "don't bother" is the cost of a silent regression. When that cost is small, the gate is overkill. When the cost is a customer-facing bug, a regulator's letter, or a routed payment to the wrong address, the gate is the only thing that catches the regression in time.

What's in the fixture corpus

The diff engine is tested against 18 scenarios, drawn from real regression patterns the team has seen in MCP integrations. The corpus covers every classification:

Three PASS scenarios: byte-identical content, multi-tool identity, and the "both responses are empty" edge case.
Seven DRIFT scenarios: rephrasing, multilingual rephrasing, key reorder, nested key reorder, additive shape changes, whitespace diffs, and case diffs.
Eight FAIL scenarios: argument value changes, numeric argument changes, a required tool not called, an extra tool called, response field removed, response field type changed, finish reason changed, and tool call order swapped.

Read the corpus README at tests/fixtures/model-compat/README.md for the full table. Treat the README as the spec: if you want to add a scenario to the engine, it goes there first.

How v1.1 fits next to v1.0

v1.0 was the CI gate for MCP servers. The YAML reference, the matchers, the compliance corpus, the reporters, the cassette layer. v1.1 sits on top of all of it. The diff engine consumes the same assertion shapes your v1.0 suite already declares; the captured baselines are the same JSON-RPC frames the v1.0 cassette layer normalizes. The new surface is three subcommands, one configuration block (model_compatibility:), and one fixture corpus.

v1.1 is the "v1.1 teaser" at the bottom of the v1.0 launch post made concrete. The user guide walks the workflow with a worked GitHub-Issues MCP server example. The fixture corpus is in the repository today and is the source of truth for the diff engine behavior the W8 implementation lands.

Three things to do next

If model regressions have bitten you before, three things move you forward today.

Capture a baseline. Even if the v1.1 binary is not on your $PATH yet, the v1.0 cassette layer already captures the JSON-RPC frames the diff engine consumes. Run your existing suite, save the cassettes, and when v1.1 lands you will have a baseline to diff against.

Star the repo. mcptest is Apache 2.0, on github.com/soapbucket/mcptest. The W8 milestone is where the diff engine lands; follow the milestone on GitHub for status.

Tell us where your suite breaks. The fixture corpus represents the patterns we have seen. The patterns you have seen are the next set of fixtures to add. File a GitHub issue with the smallest possible baseline-vs-candidate pair that captures the regression, and we will fold it into the corpus.

The model is part of your application. The gate that catches it is the work of v1.1.

Marketing notes: suggested launch-day distribution is the same channels as v1.0 (Hacker News Show HN, MCP community Discord, model-vendor partner channels, the enterprise customer list). Pair the post with a short video demo of model-compat verify against a fixture pair. Cross-post to the company blog with a longer worked example using the GitHub-Issues MCP server.