Reporters

A reporter turns a run into output. mcptest run --reporter <FORMAT> picks the format and --output <PATH> picks the sink (a file, or stdout when omitted). The default is pretty.

One run can be re-rendered into any format afterward. Capture json once, then mcptest report <run.json> --format <fmt> writes a second format from the same saved run without re-executing the suite:

mcptest run --reporter json --output run.json
mcptest report run.json --format junit --output junit.xml
mcptest report run.json --format sarif --output mcptest.sarif

The redaction policy is re-applied at the dispatch site, so secrets sealed by the JSON reporter stay sealed when another format emits from the saved run.

Format table

Format	Sink	Use it for
`pretty`	text (default)	Interactive shells and local development. One line per test plus a summary, failure detail inline.
`minimal`	text	A compact one-line summary (`ran N tests: ...`) on stdout, a `FAIL` line per failure on stderr. Terse CI logs. Run-only (not available on `mcptest report`).
`json`	file/stdout	The full run record. A `json` file is the canonical artifact `mcptest report` re-renders any other format from.
`junit`	file	A CI test reporter (GitHub Actions, GitLab, CircleCI Insights). One `<testsuite>` per server, one `<testcase>` per test.
`md`	file/stdout	A GitHub-flavored Markdown summary for a PR comment or job summary.
`html`	file	A single-file HTML report with inline CSS.
`sarif`	file	SARIF 2.1.0 for GitHub code scanning and similar finding consumers. See SARIF below.
`gitlab`	file	GitLab Code Quality JSON for the merge request widget. See GitLab Code Quality below.
`ndjson`	file/stdout	Newline-delimited JSON: one `test` record per line, then a `summary`. For log pipelines and `jq -c`.
`tap`	file/stdout	Test Anything Protocol v14, for `prove`/`tappy`-style consumers.
`matrix`	file	A self-contained HTML test-by-model comparison grid. The default output of a `--models` sweep. See Comparison matrix and model sweeps below.
`matrix-md`	file/stdout	The comparison grid as GitHub-flavored Markdown.
`agent`	text/file	Compact, model-facing output for a coding agent reading the result in a loop. See agent interface.
`openinference`	file/stdout	OpenInference/OTLP-compatible trace spans as JSONL, for eval and observability platforms. See OpenTelemetry exports.

A pretty, md, or html report can also carry a compliance score delta when the run is gated against a baseline; see compliance score delta.

SARIF

mcptest emits SARIF, the Static Analysis Results Interchange Format (v2.1.0), so the same run that proves your MCP server works can also surface findings in any tool that already speaks SARIF: GitHub code scanning, GitLab Ultimate, Azure DevOps Advanced Security, Sonar, Semgrep, and the VS Code SARIF Viewer.

Use SARIF when the consumer cares about findings, not test counts. A typical SARIF consumer renders each result inline on a pull request as a code annotation pointing at the YAML line that authored the test, and groups results by rule across runs. That is a better fit for the compliance suite than JUnit, which models results as test cases. If your CI consumer wants test totals (X passed, Y failed), keep JUnit. If it wants a security-style finding stream, use SARIF.

mcptest run --config examples/reference-server/tests/smoke.yml --reporter json --output run.json
mcptest report run.json --format sarif --output mcptest.sarif

Every failing test becomes one result entry:

SARIF field	Source
`ruleId`	`TestResult::rule_id` when set (PROTO-001, EDGE-005, ...). Falls back to `mcptest.assertion.failed`.
`level`	MUST -> `error`, SHOULD -> `warning`, MAY -> `note`. Failure with no rule defaults to `error`.
`message.text`	`<test name>: <failure message>`.
`locations[].physicalLocation.artifactLocation.uri`	`TestResult::file` (path to the YAML that authored the test).
`locations[].physicalLocation.region.startLine`	`TestResult::line`.
`properties.duration_ms`, `properties.cache_hit`, `properties.started_at`, `properties.ended_at`, `properties.severity`	Per-test telemetry. SARIF consumers ignore unknown properties.

The run object carries tool.driver.{name, version, informationUri, rules}. Every distinct rule_id seen in the run lands in the rules array with a shortDescription, fullDescription, helpUri (https://mcptest.sh/compliance/<RULE-ID>), and defaultConfiguration.level matching the RFC 2119 severity. Run-level timing lives under runs[0].properties.timing (total_duration_ms, cache_hits, started_at, ended_at).

Passing tests are not emitted as result entries because SARIF models a defect stream. Use the JSON reporter or JUnit reporter when you need the full pass list.

GitHub code scanning

name: mcptest

on:
  pull_request:
  push:
    branches: [main]

jobs:
  mcptest:
    runs-on: ubuntu-latest
    permissions:
      security-events: write   # required for upload-sarif
      contents: read
    steps:
      - uses: actions/checkout@v4

      - name: Install mcptest
        run: curl -sSL https://download.mcptest.sh/install.sh | sh

      - name: Run mcptest with SARIF output
        run: |
          mcptest run \
            --config tests/mcp.yaml \
            --reporter sarif \
            --output mcptest.sarif
        continue-on-error: true   # let the SARIF upload run even on failure

      - name: Upload SARIF to GitHub code scanning
        if: always()
        uses: github/codeql-action/upload-sarif@v3
        with:
          sarif_file: mcptest.sarif
          category: mcptest

GitHub displays each finding on the "Files changed" tab of the pull request, with the ruleId (for example PROTO-001) and a click-through to https://mcptest.sh/compliance/PROTO-001. The Security tab aggregates findings by rule so a regression spike on EDGE-* rules surfaces at a glance.

GitLab Code Quality

mcptest emits a JSON array compatible with GitLab Code Quality so failures surface as inline annotations on the merge request "Changes" tab. The same array can be consumed by any tool that speaks the GitLab Code Quality format (CodeClimate engines, GitLab CI custom widgets).

mcptest run --config examples/reference-server/tests/smoke.yml --reporter json --output run.json
mcptest report run.json --format gitlab --output gl-code-quality-report.json

Every failing test becomes one finding. Skipped tests render as minor findings so reviewers see them surface without flipping the pipeline status. Passing tests are omitted because the format models a defect stream.

Field	Source
`description`	`<test name>: <failure message>`.
`check_name`	`TestResult::rule_id` when set (PROTO-001, EDGE-005, ...). Falls back to `mcptest.assertion`.
`fingerprint`	SHA-256 hex of `rule_id + file + line` (or `name + file + line` when no rule is attached). Excludes the failure message and test name so cosmetic edits do not churn fingerprints across runs.
`severity`	MUST -> `critical`, SHOULD -> `major`, MAY -> `minor`. Test failure with no rule attached -> `major`. Skipped tests -> `minor`.
`location.path`	`TestResult::file`.
`location.lines.begin`	`TestResult::line`.
`properties.duration_ms`, `properties.cache_hit`	Per-test telemetry. GitLab ignores unknown keys.

GitLab orders severities info < minor < major < critical < blocker. Mapping RFC 2119 severity directly to that ladder keeps the merge request badge meaningful: a MUST regression is critical (block the merge), a SHOULD regression is major (review required), and a MAY regression is minor (informational). When the test is not a compliance check, the safe default is major so reviewers still see the finding.

Fingerprints stay stable across reruns of the same test even when the failure message changes. Two distinct rules at the same location still produce different fingerprints.

GitLab CI

stages:
  - test

mcptest:
  stage: test
  image: rust:1.85
  script:
    - curl -sSL https://download.mcptest.sh/install.sh | sh
    - mcptest run
        --config tests/mcp.yaml
        --reporter gitlab
        --output gl-code-quality-report.json
  artifacts:
    when: always
    reports:
      codequality: gl-code-quality-report.json
    paths:
      - gl-code-quality-report.json
    expire_in: 1 week
  allow_failure: true

reports.codequality is the well-known artifact GitLab consumes to populate the merge request widget. allow_failure: true lets the job upload the report even when mcptest exits non-zero; remove it once you want a failing run to block the merge.

Comparison matrix and model sweeps

The matrix reporter renders a run as a grid: one row per test case, one column per model (or prompt), and a pass/fail cell at each intersection with the score and a drill-down into the failing rows. It is the file-based equivalent of the side-by-side comparison view, written as a single self-contained HTML file (or a Markdown variant) with no server.

Use it when a run fans an agent test across a models: matrix and you want to see, at a glance, which model wins each test and where the failures cluster.

A matrix grid needs matrix data. A run produces it when an agent test declares a model matrix (models:) so the runner dispatches one cell per model. That matrix is carried on the report, so both the live run and a re-render from a saved report show the same grid:

agents:
  - name: summarize
    server: docs
    prompt: "Summarize the attached document in three bullets."
    models: ["gpt-4o", "claude-3-5-sonnet"]
    expect:
      - matcher:
          factuality:
            reference: "<the document>"

# Render at run time.
mcptest run suite.yml --reporter matrix --output matrix.html

# Or re-render a saved report later (same artifact).
mcptest run suite.yml --reporter json --output run.json
mcptest report run.json --format matrix --output matrix.html

When a report carries no matrix section (an ordinary, non-matrix run) the grid degrades to a single result column with one row per test, so the reporter is always well-defined.

Sweeping models from the CLI

You do not have to edit the YAML to compare models. --models runs the whole suite as a model matrix in one invocation: every agent test fans across the comma-separated list (overriding any models: it declared), one cell per model, and the run defaults to the matrix reporter so the output is the grid.

mcptest run suite.yml --models gpt-4o,claude-3-5-sonnet,gemini-2.0-flash

Bare ids are provider-auto-detected from their prefix (gpt-*, claude-*, gemini-*, ...). Pass an explicit --reporter (and --output) to override the default grid output. The exit code follows the usual rule: the run fails if any cell fails, so a green run means every model passed every agent test. Tool, resource, and prompt tests carry no model dimension, so they run once and are not part of the model grid.

matrix vs matrix-md

matrix (a self-contained HTML file): cells are colored (green pass, red fail, gray when a model produced no comparable row), show the score, and expand a why drill-down listing the failing rows and their messages. A summary row gives the per-model pass rate; a summary column gives the per-test pass rate.
matrix-md (GitHub-flavored Markdown): the same grid as a table. Because GFM table cells cannot hold block content, the per-cell drill-downs render as <details> blocks beneath the table under a Failures heading.

Secrets are redacted from labels and drill-downs by the same redactor every other reporter uses.