Scenario 16: record and replay with cassettes

You have an agent test. It sends a real prompt to a real model, the model decides to call a tool on an MCP server, and you assert on what came back. That test is honest, but it costs tokens on every run and the model's wording drifts from one day to the next. You do not want to spend money or babysit flakiness on every pull request.

The fix is a cassette. You record the session once against the real model and the real server, commit the cassette JSON, and from then on mcptest run replays it. CI needs no provider key and makes no network call. The recorder and the replayer share one normalization pass, so timestamps, UUIDs, and request IDs do not show up as spurious diffs.

This walkthrough uses the hosted test server at https://test.mcptest.sh/mcp as the recordable target. It is deterministic, so the tool side of the recording is stable. The model side is what cassettes really buy you: an agent run with --record captures the model exchanges so CI can replay them offline.

Be precise about what each step needs. The record pass calls the real model, so it needs a key. The replay pass reads the cassette, so it needs nothing.

Record once

Save this as tests/forecast-agent.yml:

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json

servers:
  hosted:
    url: "https://test.mcptest.sh/mcp"

budget:
  per_suite_usd_cents: 100   # stop a runaway record loop at 1 dollar

agents:
  - name: forecast routes to get_forecast
    server: hosted
    model: claude-sonnet-4-5
    system_prompt: "You are a weather assistant. Use the tools provided."
    user_prompt: "What is the forecast for Sacramento?"
    max_turns: 4
    expect:
      - target: "tool_calls[0].name"
        matcher:
          exact: "get_forecast"
      - target: "final_response"
        matcher:
          contains: "Sacramento"

Run it once with a key set, passing --record:

ANTHROPIC_API_KEY=sk-ant-... mcptest run --config tests/forecast-agent.yml --record

--record dispatches every agent model live and writes (or overwrites) the cassette on disk. The model call goes to Anthropic; the tool call goes to the hosted test server. Both halves of the exchange land in one agent cassette under a cassettes/ directory next to the suite, named for the (agent test, model) pair:

tests/
  forecast-agent.yml
  cassettes/
    forecast_routes_to_get_forecast__claude-sonnet-4-5.json

If you set no key, the runner falls back to the deterministic stub provider and writes a "stub cassette" you would not normally commit. The whole point of recording is to capture the real model, so set the real key.

Commit the cassettes

Review what was recorded, then commit it. The cassette is the artifact CI replays, so it belongs in version control next to the suite.

git status tests/cassettes/
git diff tests/cassettes/
git add tests/cassettes/forecast_routes_to_get_forecast__claude-sonnet-4-5.json
git commit -m "record forecast agent cassette against hosted test server"

Provider API keys are scrubbed before the cassette hits disk (anything shaped like sk-ant-..., sk-..., or AIzaSy... becomes a <...> placeholder), under the same redaction policy every reporter uses. That said, do not paste keys into prompts or tool arguments; the redactor catches the obvious shapes, not everything.

What is happening here:

The cassette is a JSON document holding the recorded trace: the tool calls the model made, the tool results the server returned, the final response, and the token / duration accounting. Your matchers (tool_calls[0].name, final_response) resolve against this JSON on replay exactly as they did against the live trace.
The cassette pins model, user_prompt, and system_prompt. If you later change any of those in the YAML, the loader refuses to replay the stale cassette and tells you to re-record. expect: blocks are not pinned, so you can add or loosen assertions without re-recording.
The normalization pass runs on both record and replay, so volatile values (timestamps, UUIDs, request IDs) are canonicalized and do not show up as diffs between the two passes.

Replay in CI

The default mcptest run replays whatever cassettes exist. Recording is the only step that is explicit; replay is just run with no --record flag and no key.

# .github/workflows/agent-tests.yml
name: agent tests
on: [push, pull_request]

jobs:
  agent:
    runs-on: ubuntu-latest
    timeout-minutes: 5
    steps:
      - uses: actions/checkout@v4
      - run: cargo install mcptest --locked
      - run: mcptest run --config tests/forecast-agent.yml
        # No ANTHROPIC_API_KEY in env. Replay needs no key and no network.

There is no provider key in the job env on purpose. Because the committed cassette exists, the runner reads it instead of calling the model, and the hosted test server is never contacted either. The worker stays offline, the run is deterministic, and every PR exercises the same recording the author committed.

If a teammate later bumps the model in the YAML (say claude-sonnet-4-5 to a newer id), the loader detects the drift and fails the run with a clear "stale on field model" error pointing them at --record. They re-record once, see the new model's behavior in the diff, and decide whether to ship.

What is happening here

--record is the only pass that touches the live model and the live server. It needs a provider key. It overwrites the cassette on disk.
Plain mcptest run replays the committed cassette. It needs no key and makes no network call to the provider.
One agent cassette captures both the model exchanges and the tool calls / results for that run, so replay reproduces the whole loop offline.
Cassettes live in a cassettes/ directory next to the suite YAML, one file per (agent test, model) pair. You commit them so the next contributor and CI both replay the same recording.
Replay durations measure how long mcptest took to read the cassette and run the matchers, not the original model latency. The real wall-clock time is preserved inside the cassette (conversation.duration_ms) so latency assertions still work.

Expected output

The record run dispatches live and writes the cassette:

mcptest run --config tests/forecast-agent.yml --record

  REC   forecast routes to get_forecast [claude-sonnet-4-5]   (1.4s)
  PASS  forecast routes to get_forecast [claude-sonnet-4-5]   (1.4s)

recorded 1 exchange -> tests/cassettes/
1 passed, 0 failed in 1.5s

A later run, with no key and no network, replays it:

mcptest run --config tests/forecast-agent.yml

  PASS  forecast routes to get_forecast [claude-sonnet-4-5]   (4ms)   [replay]

1 passed, 0 failed in 0.0s

The [replay] label is the reporter telling you exactly what the run did: it read the cassette rather than calling the model. The 4ms is the matcher pass, not the original call.

Troubleshooting

The cassette did not record the real model. No key was set, so the runner used the deterministic stub and wrote a stub cassette. Set ANTHROPIC_API_KEY (or the matching provider key) and re-run with --record.
stale on field model (or user_prompt, system_prompt). The YAML drifted from the recorded cassette. Either revert the YAML edit if it was unintentional, or re-record with --record so the cassette catches up.
The budget cap tripped during recording. The agent loop terminated and the cassette was not written. Raise budget.per_suite_usd_cents, or bound the loop with max_turns:, then re-record.
CI fails but local passes (or vice versa). It should not, because both replay the same committed cassette. If it does, confirm the cassette is actually committed (git status tests/cassettes/) and that CI checked out the commit that contains it. The cassette is the recording; reproduce the CI failure locally by running the same mcptest run against the committed file.
You see network calls in CI. A cassette is missing for that (agent, model) pair, so the runner tried to dispatch live and (with no key) fell back to the stub. Record the missing pair and commit it.