mcptest docs GitHub

Cassettes

A cassette is a JSON file on disk that captures everything a test exchanged with its server (or with a model, for agent tests) on one specific run. The next time you run that test, mcptest reads the cassette instead of making the live calls. Same matchers, same assertions, zero network, zero cost.

Cassettes are the difference between "this test only runs when I have keys" and "this test runs in CI every commit." They are why mcptest's agent tests do not bankrupt anyone.

Why you should care

Three problems cassettes solve.

1. CI does not need your API keys

Without cassettes, an agent test that hits Claude costs money on every PR. With cassettes, the first author records once against the real provider, commits the cassette, and CI replays it forever. The CI worker never sees an API key and never hits the network.

The same applies to MCP servers that are slow, flaky, or expensive to spin up. Record once on a healthy connection; replay even when the upstream is down.

2. The test is the recording

When an assertion fails, you can look at the cassette and see exactly what the model said, what the tool returned, and what the matcher saw. "It worked on my machine but not in CI" stops being a debugging mystery, because the cassette is the same on both. If mcptest run fails in CI, run the same mcptest run locally against the committed cassette and you reproduce the failure on the first try.

3. New models surface immediately

The cassette pins the prompt, the system prompt, and the model id. If you bump claude-sonnet-4-5 to claude-sonnet-4-7 in YAML, the loader sees the model drift and refuses to replay the old cassette. You re-record once, see the new model's behavior in the diff, and decide whether to ship.

How the format works

There are three kinds of cassette in v1.0: snapshot cassettes for tool-test matchers, agent cassettes for model loops, and server cassettes for replaying an MCP server's JSON-RPC traffic.

Snapshot cassettes (tool tests)

Used by the snapshot: matcher. Each is a JSON capture of the value at one assertion's target. They live in a snapshots/ directory next to the suite YAML: a snapshot: "tools/echo/baseline" key resolves to <suite_dir>/snapshots/tools/echo/baseline.json (a .json extension is appended when the key has none, and intermediate directories are created on first write). The matcher records on the first run, deep-compares on later runs, and re-records under mcptest run --update-snapshots.

Agent cassettes

One file per (agent test, model) pair. Layout:

cassettes/
  weather_query__claude-sonnet-4-5.json
  weather_query__gpt-5.json
  weather_query__gemini-2.5-pro.json

For models pulled in via a named providers: block, the path also carries the provider name so two providers serving the same model id do not collide:

cassettes/weather_query__openrouter__openai_gpt-4o.json
cassettes/weather_query__local-vllm__openai_gpt-4o.json

Each agent cassette file is a JSON document with this shape (schema at schemas/agent-cassette-v1.json):

{
  "schema_version": "v1",
  "agent_name": "weather query routes to get_weather",
  "model": "claude-sonnet-4-5",
  "user_prompt": "What is the weather in Sacramento?",
  "system_prompt": "You are a weather assistant. ...",
  "recorded_at": "2026-05-19T01:23:45Z",
  "trace": {
    "tool_calls": [
      {
        "name": "get_weather",
        "server": "weather",
        "args": { "city": "Sacramento" }
      }
    ],
    "tool_results": [
      {
        "call_index": 0,
        "server": "weather",
        "is_error": false,
        "result": { "content": [{ "type": "text", "text": "72 F, sunny" }] }
      }
    ],
    "final_response": "It is 72 F and sunny in Sacramento.",
    "conversation": {
      "tokens": { "prompt": 80, "completion": 25, "total": 105 },
      "duration_ms": 1234,
      "message_count": 4,
      "turns": []
    }
  }
}

The whole document is what your matchers resolve against during replay. tool_calls[0].name, conversation.tokens.total, final_response - they all walk this JSON.

Server cassettes (tool, resource, and prompt tests)

Agent cassettes record a model loop. A server cassette records the raw JSON-RPC traffic between mcptest and an MCP server, so a suite of tools:, resources:, and prompts: tests can run with no live server at all. You point a server at a cassette instead of a command or a URL:

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json

servers:
  recorded:
    cassette: ./cassettes/issues-server.json

tools:
  - name: "search returns a result"
    server: recorded
    tool: search
    args: { query: "login bug" }
    expect:
      - target: result.content[0].text
        matcher: { contains: "login" }

The cassette is a transport-agnostic JSON file: an ordered list of request/response exchanges, each a JSON-RPC envelope. The same file replays whether the server was originally reached over stdio or HTTP, because the recorded payload is the envelope, never the wire framing:

{
  "version": "1",
  "exchanges": [
    {
      "request":  { "jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {} },
      "response": { "jsonrpc": "2.0", "id": 1, "result": { "protocolVersion": "2025-03-26", "capabilities": {}, "serverInfo": { "name": "issues", "version": "1.0" } } }
    },
    {
      "request":  { "jsonrpc": "2.0", "id": 2, "method": "tools/call", "params": { "name": "search", "arguments": { "query": "login bug" } } },
      "response": { "jsonrpc": "2.0", "id": 2, "result": { "content": [ { "type": "text", "text": "issue #42: login bug" } ] } }
    }
  ]
}

Replay matches each request against the recorded exchanges in order. The initialize handshake matches on method alone, so a cassette survives an mcptest upgrade even though the client version it sends drifts. Other methods match on method plus arguments (volatile values like timestamps and UUIDs are normalized first). A request the cassette never recorded fails the test with a clear replay error instead of hanging, so a stale cassette is loud, not silent.

You can hand-author a cassette (the format above is the whole contract) or let mcptest capture one against a live server. Recorded cassettes are scrubbed of secrets before they hit disk, the same redaction policy every reporter uses.

The record / replay workflow

The default mcptest run replays whatever cassettes exist. Recording is a separate, explicit step.

# Step 1: record. Set every key you have. Models with no matching key
# fall back to the deterministic stub provider and produce a "stub
# cassette" you would not normally commit.
ANTHROPIC_API_KEY=sk-ant-...    \
OPENAI_API_KEY=sk-...           \
GEMINI_API_KEY=AIza...          \
mcptest run --record

# Step 2: review the diff.
git status cassettes/
git diff cassettes/

# Step 3: commit the cassettes.
git add cassettes/
git commit -m "record agent baseline for weather query"

# Step 4: every subsequent run replays the cassettes. No key set, no
# network call to the provider, deterministic timing.
mcptest run

The reporter labels each run with what it actually did:

mcptest::weather query [claude-sonnet-4-5]  PASS  (4ms)   [replay]
mcptest::weather query [gpt-5]                       PASS  (3ms)   [replay]
mcptest::weather query [gemini-2.5-pro]              PASS  (2ms)   [replay]

When the replay path is taken, the duration reflects how long it took mcptest to read the cassette and run the matchers, not how long the original model call took. The original wall-clock duration lives in conversation.duration_ms inside the cassette, so assertions about real latency still work.

When a cassette goes stale

The loader compares three fields against the YAML before agreeing to replay: model, user_prompt, and system_prompt. Any drift surfaces a clear error before the matchers run:

mcptest run: cassette for agent `weather query` model `gpt-5` is
stale on field `model` (cassette=`gpt-4o`, yaml=`gpt-5`); rerun with
--record

Two ways out:

The loader does not compare expect: blocks. Adding new assertions or loosening existing ones does not invalidate the cassette. The matchers just run against the recorded trace.

Adding a new model

The matrix workflow shines here. Imagine Anthropic releases claude-sonnet-4-7. You want to know whether your existing assertions hold under the new model before shipping the upgrade:

agents:
  - name: weather query
    models:
      - claude-sonnet-4-5    # existing, cassette already committed
      - claude-sonnet-4-7             # new, no cassette yet
      - gpt-5                         # existing, cassette already committed
    ...
ANTHROPIC_API_KEY=sk-ant-... mcptest run --record

The runner replays the two cassettes that exist and records a new one for claude-sonnet-4-7. The report shows pass / fail per model. If the new cassette breaks an assertion, you know exactly which one and exactly what the model said. Commit (or do not commit) the new cassette based on what you find.

Cost and budget caps

Cassettes only help with replay cost; the record pass still calls the real provider. The budget: block at the top of the YAML stops a runaway record loop before it adds up:

budget:
  per_test_usd_cents: 50      # cap one agent test at 50 cents
  per_suite_usd_cents: 500    # cap the whole suite at 5 dollars

When the cap trips during recording, the agent loop terminates loudly and the cassette is not written. You can raise the cap and re-record, or tighten max_turns: so the loop is bounded by step count instead of dollars.

Key redaction

Anything that looks like a provider API key (sk-ant-..., sk-proj-..., sk-..., AIzaSy...) is replaced with a <...> placeholder before the cassette JSON lands on disk. The same redactor runs on every error the CLI emits, so a transport failure that includes the bearer header will not leak the key into the cassette or the CI log.

That said: do not paste keys into prompts or tool arguments. The redactor catches the obvious shapes but cannot read your mind.

Where cassettes live in the repo

LayoutCassette location
Suite at repo root./cassettes/<agent>__<model>.json
Suite in a subdirectory<dir>/cassettes/<agent>__<model>.json
Snapshot matcherpath the user specified in the YAML

You commit them. The whole point is that the next contributor and CI both see the same recording. The .mcptest/ directory is for local, disposable state (the --last-failed list once that ships) and is gitignored by mcptest init's scaffold.

Where to next