Spec: Schema evolution diff (mcptest diff)
- Status: Draft, deferred
Run this example. mcptest diff ships today. examples/diff-tools-baseline.json and examples/diff-tools-current.json differ by a removed tool and a newly required argument, so the diff reports two breaking changes and exits non-zero.
mcptest diff examples/diff-tools-baseline.json examples/diff-tools-current.json
Purpose
mcptest diff compares two snapshots of an MCP server's public surface and flags changes that are likely to break callers. The snapshot is the union of tools/list and resources/list (and any extensions to the discovery surface that ship later). The diff classifies each change by severity and emits structured output that a CI pipeline can act on.
This spec covers the command shape, the severity classification table, the output formats, the exit-code policy, and the worked example output. It does not cover protocol-level diffs (request shape, header behavior, error taxonomy) because those belong to the regular test path, not the catalog path.
Use cases
- Pre-merge CI guardrail. A service owner runs
mcptest diffagainst the previous release's catalog baseline as part of CI. The build fails when a tool argument flips from optional to required, an enum value disappears, or a tool is removed without a deprecation window. - Pre-deploy gate. A deploy pipeline runs
mcptest diffbetween the staging cassette and the production cassette before promoting a build. Breaking changes block the promotion. - Audit and changelog generation. A release engineer runs
mcptest diffto produce the catalog-change section of the release notes. Output renders to Markdown and drops into the changelog. - Consumer impact assessment. A consumer of a third-party MCP server records a cassette today, records another after the next release, and uses the diff to scope client work.
Command shapes
Three input combinations are supported. All three return the same diff shape; only the inputs differ.
Cassette to cassette
mcptest diff <baseline.cassette> <current.cassette>
Both arguments are recorded cassette files. The diff is computed from the recorded tools/list and resources/list entries in each cassette. If a cassette does not contain one or both responses, the command exits with a clear error pointing at the refresh path (see "Cassette refresh path" below).
Cassette to live server
mcptest diff <baseline.cassette> --against <url|command>
<url|command> is either an HTTP URL the runner can reach (URL transport) or a subprocess specification matching the servers[].command shape in the YAML schema. The runner connects to the live server, issues initialize plus tools/list plus resources/list, and diffs the response against the baseline cassette. Authentication, if needed, comes from the standard mcptest auth surface (env, dotenv, CLI flag).
Two baselines (advanced)
mcptest diff --baseline mcptest.baseline-catalog.yml --against <target>
mcptest.baseline-catalog.yml is a hand-authored or generated catalog file checked into the repo (see "Baseline catalog file" below). The --against target is either a cassette path, a URL, or a subprocess. The two-baseline shape is the canonical "CI guardrail" usage: the baseline file moves only via deliberate review, and any drift between the baseline and the live server fails the build.
Severity classification
Every diff entry carries one of three severities. The classification table is the load-bearing artifact of this spec; reviewers should read it carefully because the severity decides the exit code.
| Change | Severity | Notes | |
|---|---|---|---|
| Tool removed | BREAKING | Existing callers fail immediately. | |
| Tool renamed | BREAKING | Modeled as removal plus addition unless the rename is annotated (see "Open questions"). | |
| Argument optional becomes required | BREAKING | Callers that previously omitted the argument now fail validation. | |
| Argument default value changed | BREAKING | Behavior shift for callers relying on the old default, even though calls still validate. | |
| Argument removed entirely | BREAKING | Callers that previously sent the argument now fail validation. | |
| Argument type narrowed (e.g. `string | null to string`) | BREAKING | Callers sending the narrower-out variant fail. |
| Enum value removed | BREAKING | Callers sending the removed value fail validation. | |
| URI template shape changed | BREAKING | Resource subscribers expecting the old shape break. | |
| Output schema tightened (e.g. field removed, type narrowed) | BREAKING | Consumers parsing the removed field fail. | |
| Output schema gains a required field | POTENTIALLY BREAKING | Consumers that did not previously decode the field may still tolerate it; consumers that strict-parse fail. | |
| Tool added | NON-BREAKING | New capability. | |
| Argument added as optional with default | NON-BREAKING | Existing callers still validate. | |
Argument type widened (e.g. string to `string | null`) | NON-BREAKING | Callers sending the old type still validate. |
| Enum value added | NON-BREAKING | Callers sending the old set still validate. | |
| Output schema relaxed (field made optional) | NON-BREAKING | Consumers parsing the field still see it when present. | |
| Description text changed | NON-BREAKING | Caller behavior is unaffected. Description quality is a separate concern (see docs/description-quality.md). |
The severity for a given diff is the maximum severity across all the changes inside that diff. A single tool that adds an optional argument (NON-BREAKING) and removes another tool (BREAKING) gets BREAKING overall.
Output formats
mcptest diff reuses the existing reporter set. The reporter is selected with --format. Supported values:
pretty(default): human-readable text, color in a TTY, ASCII fallback out of a TTY. Used for local runs and developer terminals.json: structured JSON for downstream tooling. The top-level shape is{summary, changes[]}, where each change is{kind, severity, path, before, after, description}.junit: each tool change becomes a test case, each BREAKING change is a failed test, each NON-BREAKING change is a passed test, each POTENTIALLY BREAKING change is a skipped test with a message. The shape fits the existing JUnit reporter.markdown: a PR-comment-friendly summary. Tables for added, removed, and changed tools. Severity is rendered as a leading badge per row. Designed to be pasted into a GitHub or GitLab PR comment by a CI action.sarif: each change becomes a SARIF result entry, severity maps to SARIF level (errorfor BREAKING,warningfor POTENTIALLY BREAKING,notefor NON-BREAKING). Designed for code-scanning surfaces that ingest SARIF (GitHub Advanced Security, Sonar).
All five reporters share the same internal diff model. Adding a sixth reporter does not require rewriting the diff engine.
Exit codes
The exit code is the load-bearing CI signal. The policy is deliberately distinct from the test runner's exit codes so that mcptest run and mcptest diff can be invoked from the same script without ambiguity.
| Code | Meaning |
|---|---|
| 0 | No changes, or only NON-BREAKING changes detected. |
| 6 | BREAKING changes detected. |
| 7 | POTENTIALLY BREAKING changes detected, no BREAKING changes. (Optional, off by default; enabled with --strict-potentially-breaking.) |
| 64 | CLI usage error (missing file, bad flag). Inherits the EX_USAGE convention. |
| 70 | Internal error (panic, IO failure, cassette parse failure). Inherits the EX_SOFTWARE convention. |
Exit code 1 is reserved for mcptest run test failures; diff never uses it. Code 6 was chosen because it does not collide with any reserved Unix sysexit code and because it is mnemonic ("breaking changes").
Cassette refresh path
Older cassettes recorded before the catalog-aware recorder shipped do not necessarily include tools/list and resources/list. When mcptest diff loads such a cassette, it exits with code 64 and prints:
error: cassette `<path>` does not include a tools/list response.
hint: refresh the cassette catalog with:
mcptest record --catalog-only --cassette <path>
mcptest record --catalog-only re-runs initialize, tools/list, and resources/list against the original server and patches the cassette in place. The flag is additive: existing recorded interactions inside the cassette are preserved. The recorder fails with a clear error if the original server cannot be reached.
Baseline catalog file
The two-baseline shape uses a YAML file (mcptest.baseline-catalog.yml) that names a set of tools and their argument shapes. The file is a human-readable subset of the cassette format: only the catalog matters, no recorded interactions. The format is the subject of a separate follow-up spec; the diff command treats the baseline file as opaque and delegates parsing to the catalog crate.
The reason to support a hand-authored baseline rather than insisting on a cassette: cassettes record real responses, which may include environment- specific data (account IDs, hostnames) that the catalog should not carry. A baseline file is the "intended public surface" in source control, edited deliberately, reviewed in PRs.
Worked example output
Pretty reporter, against a fixture pair with several intentional changes:
$ mcptest diff cassettes/v1.cassette cassettes/v2.cassette
Tool catalog diff: cassettes/v1.cassette -> cassettes/v2.cassette
Tools added (1):
+ delete_issue
args: id (string, required)
description: "Delete a Linear issue by ID."
Tools removed (1):
- archive_issue (BREAKING)
last seen with: args.id (string, required)
Tools changed (2):
create_issue
args.priority: optional -> required (BREAKING)
args.labels: default value `[]` removed (BREAKING)
args.estimate: added, optional, type number (NON-BREAKING)
list_issues
result.next_cursor: type widened string -> string|null (BREAKING)
Resources unchanged.
Summary: 4 BREAKING, 1 NON-BREAKING, 0 POTENTIALLY BREAKING.
Exit code: 6
The JSON reporter renders the same diff as:
{
"summary": {
"breaking": 4,
"potentially_breaking": 0,
"non_breaking": 1
},
"changes": [
{
"kind": "tool_added",
"severity": "non_breaking",
"path": "tools.delete_issue",
"before": null,
"after": { "name": "delete_issue", "args": [{ "name": "id", "type": "string", "required": true }] }
},
{
"kind": "tool_removed",
"severity": "breaking",
"path": "tools.archive_issue",
"before": { "name": "archive_issue" },
"after": null
},
{
"kind": "arg_optional_to_required",
"severity": "breaking",
"path": "tools.create_issue.args.priority",
"before": { "required": false },
"after": { "required": true }
}
]
}
JUnit, Markdown, and SARIF outputs follow the same internal model.
Open questions
Three questions are deferred to the implementation ticket. Each is called out here so reviewers know they are unresolved.
- Resource URI templates with parameter renames. A URI template like
issues/{issue_id}becomingissues/{id}changes the parameter name but not the structural shape. Is that a BREAKING change (clients parsing the parameter name fail), a POTENTIALLY BREAKING change (clients matching on shape are fine), or NON-BREAKING (most clients pass through opaque)? The current proposal is POTENTIALLY BREAKING, with a flag to upgrade to BREAKING for strict shops. - "Renamed" versus "removed plus added". A tool that gets renamed from
create_issuetocreate_ticketlooks identical to a removal plus an addition. The cassette format has no rename annotation today. Options: (a) treat all renames as removal plus addition and document that authors should add a deprecation window; (b) add a rename annotation to the catalog format; (c) heuristic match on argument shape similarity. The proposal is (a) for the first implementation, with (b) as a future follow-up if user demand surfaces. - Description-only changes inside otherwise unchanged tools. A tool whose description text changes but whose argument shape does not is NON-BREAKING by the table above. Some shops want to see those changes in the diff anyway (for changelog generation). The proposal is to emit them in the diff with
severity: non_breakingbut to suppress them from the summary unless--show-descriptionsis passed. Verdict pending implementation.
Implementation notes (non-binding)
- The diff engine lives in
mcptest-core(new modulemcptest_core::catalog::diff). The crate already owns the protocol layer and is the natural home for catalog logic. - The cassette catalog reader lives in
mcptest-cassette. It exposesCassette::catalog()returningCatalog { tools, resources, ... }. - The JSON Schema entry for the baseline catalog file lives in
schemas/v1.jsonunder a new top-level key (catalog_baseline). - Reporter integration goes through
mcptest-report. Each reporter picks upDiffSummaryand renders its native format. - Tests live as fixture-pair integration tests under
tests/diff/<scenario>/. The first fixtures are: identical catalogs, one new tool, one removed tool, one breaking arg change, one description-only change.
References
- The cassette format (defines what the diff reads).
- The expected-failures baseline (the test-result baseline pattern; this spec is the catalog-baseline analogue).
- Cassette portability (cassettes recorded in one environment must be diffable in another).
- Cassette determinism normalization (the diff must not flag normalized fields as catalog changes).