mcptest docs GitHub

Models and the compatibility matrix

mcptest's agent test type points a real model at a real MCP server and asserts against the resulting conversation: which tools the model called, what arguments it passed, the final text reply, and the conversation telemetry (tokens, duration, message count). A test you write once can be run against any number of models, so when a new release ships you keep the test, drop the new identifier into models:, and the report tells you instantly whether anything broke.

Provider families and env vars

FamilyDetected when model starts withEnv var(s)Notes
Anthropicclaude-ANTHROPIC_API_KEYTool use is first-class.
OpenAIgpt-, chatgpt-, o<digit>, text-, davinci-OPENAI_API_KEY (and optional OPENAI_ORG_ID)Covers the gpt- and o-series reasoning models.
Googlegemini-, models/gemini-GEMINI_API_KEY (falls back to GOOGLE_API_KEY)AI Studio API.
Mistralmistral-, codestral-, magistral-, ministral-, devstral-, pixtral-, open-mistral-, open-mixtral-MISTRAL_API_KEYLa Plateforme API.

When mcptest sees a model string it doesn't recognize, the run keeps going against a deterministic stub instead of failing the suite. The reporter logs which provider was used or skipped per case, so you can see at a glance whether the matrix actually exercised the model you expected.

Adding a new family is a five-line patch to crates/mcptest-agent/src/provider.rs (predicate + env-var lookup + constructor). No CLI changes required.

The matrix form

agents:
  - name: weather query routes to get_weather
    models:
      - claude-sonnet-4-5
      - claude-opus-4-7
      - gpt-5
      - gemini-2.5-pro
    servers: [weather]
    prompt: What is the weather in Sacramento?
    expect:
      - target: tool_calls[0].name
        matcher: { exact: get_weather }
      - target: tool_calls[0].args.city
        matcher: { regex: "(?i)sacramento" }
      - target: conversation.tokens.total
        matcher: { regex: "^[0-9]+$" }

What that prints (illustrative):

mcptest::weather query routes to get_weather [claude-sonnet-4-5]  PASS
mcptest::weather query routes to get_weather [claude-opus-4-7]             PASS
mcptest::weather query routes to get_weather [gpt-5]                       FAIL
        tool_calls[0].name: expected get_weather, got search
mcptest::weather query routes to get_weather [gemini-2.5-pro]              PASS

3 of 4 models passed.

model: <id> (singleton form) still works and is shorthand for models: [<id>]. Suites that already use model: keep running unchanged.

Record once, replay forever

Each (agent, model) pair gets its own cassette at cassettes/<agent_slug>__<model_slug>.json. The workflow:

# First record. Set every key you have; missing keys fall back to
# the stub so a partial record still produces a useful matrix.
ANTHROPIC_API_KEY=sk-ant-...  \
OPENAI_API_KEY=sk-proj-...    \
GEMINI_API_KEY=AIza...        \
  mcptest run --record

# Commit the cassettes. CI runs `mcptest run` (no flag) and replays
# them deterministically. No API key, no spend, exact same matchers.
git add cassettes/
git commit -m "record agent baseline"

Adding a new model to models: later does not invalidate the existing recordings. Run mcptest run --record again with the new provider's key set and only the new model's cassette is written.

When a release breaks something

The matrix is built for this case. Workflow:

  1. A new Claude / GPT / Gemini drops.
  2. Add the identifier to models: in your suite.
  3. With the relevant API key set, run mcptest run --record.
  4. mcptest writes a cassette for the new model only and shows you the first matcher that fails.
  5. Decide whether to fix your MCP server (it diverged), tighten the assertion (the old assertion was lucky), or document the regression.

If the new model also passes, commit the cassette and CI tracks the new baseline.

Cost guardrails

Both record runs and stub runs respect the per-test and per-suite budget knobs at the top of the YAML:

budget:
  per_test_usd_cents: 50
  per_suite_usd_cents: 200

Per-model fan-out multiplies the per-suite spend. If the matrix has four models and each costs around 1 cent per call, a recording pass is under five cents, but a runaway agent loop hits the budget and terminates loudly rather than silently.

Key redaction

The cassette writer scrubs anything that looks like a provider key (sk-ant-..., sk-proj-..., sk-..., AIzaSy...) out of the serialized JSON before it lands on disk. The same redaction runs on any error message the CLI emits, so a reqwest transport failure won't leak a key into a run record or a CI log.

Multi-server agent tests

Real agent workflows usually span more than one MCP server: open an issue and ping a channel, search a corpus and write a summary, schedule a meeting and email an invite. List every server the agent needs under servers: and the driver exposes the merged tool catalog to the model. Tool calls are routed back to the owning server and the trace records which server handled each call so suites can pin the routing.

servers:
  issues:
    command: ["./issues-server"]
  notifications:
    command: ["./notifications-server"]

agents:
  - name: open issue and notify oncall
    model: claude-sonnet-4-5
    servers: [issues, notifications]
    prompt: Open a P1 issue for the failing CI run and ping #oncall.
    expect:
      - target: tool_calls[0].server
        matcher: { exact: issues }
      - target: tool_calls[0].name
        matcher: { exact: create_issue }
      - target: tool_calls[1].server
        matcher: { exact: notifications }
      - target: tool_calls[1].name
        matcher: { exact: send_message }

The model sees each tool as <server>__<tool> so name collisions across servers cannot happen (issues__create_issue versus a hypothetical notifications__create_issue). Tool calls in the trace keep the bare tool name and gain a server field so the YAML target grammar stays clean. Single-server runs keep the bare tool name on the wire too, so existing one-server suites are unaffected.

A worked example lives at examples/agent-issues-and-notifications.yml.

Custom OpenAI-compatible endpoints

When you need to target an endpoint the auto-detect cannot identify (Azure OpenAI, OpenRouter, vLLM, llama.cpp server, LiteLLM, Together, Groq, Anyscale, Fireworks, Anthropic via Bedrock, etc.), declare it under a top-level providers: block and reference it from models::

providers:
  openrouter:
    type: openai            # the only supported wire protocol today
    base_url: https://openrouter.ai/api/v1
    api_key_env: OPENROUTER_API_KEY
  azure-prod:
    type: openai
    base_url: https://my-resource.openai.azure.com/openai/deployments/my-gpt-5
    api_key_env: AZURE_OPENAI_KEY
    organization: my-azure-org   # optional, sent as OpenAI-Organization
  local-vllm:
    type: openai
    base_url: http://localhost:8000/v1
    # api_key_env omitted -> the runner sends `Authorization: Bearer EMPTY`
    # which the common self-hosted servers accept.

agents:
  - name: weather query routes to get_weather
    models:
      - claude-sonnet-4-5                              # auto-detect (uses ANTHROPIC_API_KEY)
      - { provider: openrouter, id: openai/gpt-4o }             # named provider
      - { provider: openrouter, id: anthropic/claude-3.5-sonnet }
      - { provider: azure-prod, id: my-gpt-5 }
      - { provider: local-vllm, id: meta-llama/Llama-3.1-70B-Instruct }
    servers: [weather]
    prompt: What is the weather in Sacramento?
    expect:
      - target: tool_calls[0].name
        matcher: { exact: get_weather }

The reporter labels rows that came from a named provider as [provider/id] (so weather query [openrouter/openai/gpt-4o]), keeping auto-detect rows at [id]. Cassettes for named entries land at cassettes/<agent>__<provider>__<model>.json so the same model id served by two providers stays isolated on disk.

When to use which

Adding a model the CLI doesn't recognize yet

Two paths:

  1. Quickest: declare a providers: entry pointed at the right endpoint and reference it from models:. No code change required.
  2. Cleanest: add a five-line predicate + env-var lookup in crates/mcptest-agent/src/provider.rs so the family is auto-detected by future suites.

Either works; the first is the right move when you're trying things out and the second is the right move when a family is going to be a permanent fixture.

For Ollama specifically, point a providers: entry at http://localhost:11434/v1 (Ollama's OpenAI-compatible shim) and reference your model ids from there. The native Ollama provider lives in mcptest-core for callers that want streaming, but the OpenAI-compatible shim is enough for agent tests today.