mcptest docs GitHub

Scenario 16: record and replay with cassettes

You have an agent test. It sends a real prompt to a real model, the model decides to call a tool on an MCP server, and you assert on what came back. That test is honest, but it costs tokens on every run and the model's wording drifts from one day to the next. You do not want to spend money or babysit flakiness on every pull request.

The fix is a cassette. You record the session once against the real model and the real server, commit the cassette JSON, and from then on mcptest run replays it. CI needs no provider key and makes no network call. The recorder and the replayer share one normalization pass, so timestamps, UUIDs, and request IDs do not show up as spurious diffs.

This walkthrough uses the hosted test server at https://test.mcptest.sh/mcp as the recordable target. It is deterministic, so the tool side of the recording is stable. The model side is what cassettes really buy you: an agent run with --record captures the model exchanges so CI can replay them offline.

Be precise about what each step needs. The record pass calls the real model, so it needs a key. The replay pass reads the cassette, so it needs nothing.

Record once

Save this as tests/forecast-agent.yml:

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json

servers:
  hosted:
    url: "https://test.mcptest.sh/mcp"

budget:
  per_suite_usd_cents: 100   # stop a runaway record loop at 1 dollar

agents:
  - name: forecast routes to get_forecast
    server: hosted
    model: claude-sonnet-4-5
    system_prompt: "You are a weather assistant. Use the tools provided."
    user_prompt: "What is the forecast for Sacramento?"
    max_turns: 4
    expect:
      - target: "tool_calls[0].name"
        matcher:
          exact: "get_forecast"
      - target: "final_response"
        matcher:
          contains: "Sacramento"

Run it once with a key set, passing --record:

ANTHROPIC_API_KEY=sk-ant-... mcptest run --config tests/forecast-agent.yml --record

--record dispatches every agent model live and writes (or overwrites) the cassette on disk. The model call goes to Anthropic; the tool call goes to the hosted test server. Both halves of the exchange land in one agent cassette under a cassettes/ directory next to the suite, named for the (agent test, model) pair:

tests/
  forecast-agent.yml
  cassettes/
    forecast_routes_to_get_forecast__claude-sonnet-4-5.json

If you set no key, the runner falls back to the deterministic stub provider and writes a "stub cassette" you would not normally commit. The whole point of recording is to capture the real model, so set the real key.

Commit the cassettes

Review what was recorded, then commit it. The cassette is the artifact CI replays, so it belongs in version control next to the suite.

git status tests/cassettes/
git diff tests/cassettes/
git add tests/cassettes/forecast_routes_to_get_forecast__claude-sonnet-4-5.json
git commit -m "record forecast agent cassette against hosted test server"

Provider API keys are scrubbed before the cassette hits disk (anything shaped like sk-ant-..., sk-..., or AIzaSy... becomes a <...> placeholder), under the same redaction policy every reporter uses. That said, do not paste keys into prompts or tool arguments; the redactor catches the obvious shapes, not everything.

What is happening here:

Replay in CI

The default mcptest run replays whatever cassettes exist. Recording is the only step that is explicit; replay is just run with no --record flag and no key.

# .github/workflows/agent-tests.yml
name: agent tests
on: [push, pull_request]

jobs:
  agent:
    runs-on: ubuntu-latest
    timeout-minutes: 5
    steps:
      - uses: actions/checkout@v4
      - run: cargo install mcptest --locked
      - run: mcptest run --config tests/forecast-agent.yml
        # No ANTHROPIC_API_KEY in env. Replay needs no key and no network.

There is no provider key in the job env on purpose. Because the committed cassette exists, the runner reads it instead of calling the model, and the hosted test server is never contacted either. The worker stays offline, the run is deterministic, and every PR exercises the same recording the author committed.

If a teammate later bumps the model in the YAML (say claude-sonnet-4-5 to a newer id), the loader detects the drift and fails the run with a clear "stale on field model" error pointing them at --record. They re-record once, see the new model's behavior in the diff, and decide whether to ship.

What is happening here

Expected output

The record run dispatches live and writes the cassette:

mcptest run --config tests/forecast-agent.yml --record

  REC   forecast routes to get_forecast [claude-sonnet-4-5]   (1.4s)
  PASS  forecast routes to get_forecast [claude-sonnet-4-5]   (1.4s)

recorded 1 exchange -> tests/cassettes/
1 passed, 0 failed in 1.5s

A later run, with no key and no network, replays it:

mcptest run --config tests/forecast-agent.yml

  PASS  forecast routes to get_forecast [claude-sonnet-4-5]   (4ms)   [replay]

1 passed, 0 failed in 0.0s

The [replay] label is the reporter telling you exactly what the run did: it read the cassette rather than calling the model. The 4ms is the matcher pass, not the original call.

Troubleshooting

See also