mcptest docs GitHub

Scenario 17: one agent test across a model matrix

How this page relates to the model compatibility guide: scenarios are five-minute runnable walkthroughs; guides are the full reference. This page fans one agent test across a models: list and reads the resulting grid. For the baseline/candidate/diff rollout workflow (mcptest model-compat), the classification model, and the baseline file format, read the guide.

Your server works with the model you developed against. Then a customer connects a different one, and a tool that fired reliably stops firing on a phrasing the new model parses differently. Nothing in your server changed; the verdict still flipped. The cheapest way to catch this class of breakage is to take one agent test you already trust and run it against every model you care about, in one invocation, with one report.

Five minutes, three steps: add a models: list to an existing agent test, run the sweep, read the matrix.

Step 1: start from an agent test you already have

Any agent test works. This one asks for the weather and asserts the model actually called the tool with the right argument, the first-test pattern plus an agent block:

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json

servers:
  weather:
    command: ["./examples/reference-server/weather.sh"]

budget:
  per_test_usd_cents: 50
  per_suite_usd_cents: 500

agents:
  - name: weather query routes to get_weather
    model: claude-sonnet-4-5
    servers: [weather]
    prompt: What is the weather in Sacramento?
    max_turns: 3
    expect:
      - target: tool_calls[0].name
        matcher: { exact: get_weather }
      - target: tool_calls[0].args.city
        matcher: { icontains: sacramento }

Step 2: fan it across models

Swap the single model: for a models: list. The runner dispatches one cell per model; everything else about the test stays the same:

agents:
  - name: weather query routes to get_weather
    models:
      - claude-sonnet-4-5     # ANTHROPIC_API_KEY
      - gpt-5                 # OPENAI_API_KEY
      - gemini-2.5-pro        # GEMINI_API_KEY or GOOGLE_API_KEY
    servers: [weather]
    prompt: What is the weather in Sacramento?
    max_turns: 3
    expect:
      - target: tool_calls[0].name
        matcher: { exact: get_weather }
      - target: tool_calls[0].args.city
        matcher: { icontains: sacramento }

Bare ids are provider-auto-detected from their prefix (claude-*, gpt-*, gemini-*, ...). Set the provider key for each model you want to exercise.

You can also skip the YAML edit entirely: --models fans every agent test in the suite across a comma-separated list, overriding any models: declared in the file:

mcptest run suite.yml --models claude-sonnet-4-5,gpt-5,gemini-2.5-pro

Step 3: read the matrix

With --models, the run defaults to the matrix reporter: one row per test, one column per model, a pass/fail cell with the score at each intersection. For a file you can open or attach to a PR:

mcptest run suite.yml \
  --models claude-sonnet-4-5,gpt-5,gemini-2.5-pro \
  --reporter matrix --output matrix.html

The grid shows weather query routes to get_weather as a row and the three models as columns. A green cell means that model passed every assertion; a red cell expands a why drill-down listing the failing rows (for example, tool_calls[0].name resolved to nothing because the model answered from memory instead of calling the tool). The summary row gives the per-model pass rate. --reporter matrix-md writes the same grid as a GitHub-flavored Markdown table.

The exit code follows the usual rule: the run fails if any cell fails, so a green run means every model passed the test. Tool, resource, and prompt tests carry no model dimension; they run once and stay out of the grid.

Where to go from here

A passing matrix today does not prove next month's silent model update keeps passing. To gate a rollout on "the new model behaves like the old one", capture a baseline with mcptest model-compat capture and diff a candidate against it; that workflow, with its PASS / DRIFT / FAIL classification, is the model compatibility guide.

See also