mcptest docs GitHub

Comparison-matrix reporter

The matrix reporter renders a run as a grid: one row per test case, one column per model (or prompt), and a pass/fail cell at each intersection with the score and a drill-down into the failing rows. It is the file-based equivalent of the side-by-side comparison view, written as a single self-contained HTML file (or a Markdown variant) with no server.

Use it when a run fans an agent test across a models: matrix and you want to see, at a glance, which model wins each test and where the failures cluster.

When the grid is populated

A matrix grid needs matrix data. A run produces it when an agent test declares a model matrix (models:) so the runner dispatches one cell per model. That matrix is carried on the report (the matrix section), so both the live run and a re-render from a saved report show the same grid:

# Render at run time.
mcptest run suite.yml --reporter matrix --output matrix.html

# Or re-render a saved report later (same artifact).
mcptest run suite.yml --reporter json --output run.json
mcptest report run.json --format matrix --output matrix.html

When a report carries no matrix section (an ordinary, non-matrix run) the grid degrades to a single result column with one row per test, so the reporter is always well-defined.

Sweeping models from the CLI

You do not have to edit the YAML to compare models. --models runs the whole suite as a model matrix in one invocation: every agent test fans across the comma-separated list (overriding any models: it declared), one cell per model, and the run defaults to the matrix reporter so the output is the grid.

mcptest run suite.yml --models gpt-4o,claude-3-5-sonnet,gemini-2.0-flash

Bare ids are provider-auto-detected from their prefix (gpt-*, claude-*, gemini-*, ...). Pass an explicit --reporter (and --output) to override the default grid output. The exit code follows the usual rule: the run fails if any cell fails, so a green run means every model passed every agent test. Tool, resource, and prompt tests carry no model dimension, so they run once and are not part of the model grid.

Formats

Both are file/stdout formats like every other reporter: pick the sink with --output (a path) or let it stream to stdout.

Example

agents:
  - name: summarize
    server: docs
    prompt: "Summarize the attached document in three bullets."
    models: ["gpt-4o", "claude-3-5-sonnet"]
    expect:
      - matcher:
          factuality:
            reference: "<the document>"
mcptest run suite.yml --reporter matrix --output matrix.html
open matrix.html

The grid shows summarize as a row, gpt-4o and claude-3-5-sonnet as columns, a pass/fail cell for each, the per-model pass rate along the bottom, and the per-test pass rate down the right. Secrets are redacted from labels and drill-downs by the same redactor every other reporter uses.