Reporters
A reporter turns a run into output. mcptest run --reporter <FORMAT> picks the format and --output <PATH> picks the sink (a file, or stdout when omitted). The default is pretty.
One run can be re-rendered into any format afterward. Capture json once, then mcptest report <run.json> --format <fmt> writes a second format from the same saved run without re-executing the suite:
mcptest run --reporter json --output run.json
mcptest report run.json --format junit --output junit.xml
mcptest report run.json --format sarif --output mcptest.sarif
The redaction policy is re-applied at the dispatch site, so secrets sealed by the JSON reporter stay sealed when another format emits from the saved run.
Format table
| Format | Sink | Use it for |
|---|---|---|
pretty | text (default) | Interactive shells and local development. One line per test plus a summary, failure detail inline. |
minimal | text | A compact one-line summary (ran N tests: ...) on stdout, a FAIL line per failure on stderr. Terse CI logs. Run-only (not available on mcptest report). |
json | file/stdout | The full run record. A json file is the canonical artifact mcptest report re-renders any other format from. |
junit | file | A CI test reporter (GitHub Actions, GitLab, CircleCI Insights). One <testsuite> per server, one <testcase> per test. |
md | file/stdout | A GitHub-flavored Markdown summary for a PR comment or job summary. |
html | file | A single-file HTML report with inline CSS. |
sarif | file | SARIF 2.1.0 for GitHub code scanning and similar finding consumers. See SARIF below. |
gitlab | file | GitLab Code Quality JSON for the merge request widget. See GitLab Code Quality below. |
ndjson | file/stdout | Newline-delimited JSON: one test record per line, then a summary. For log pipelines and jq -c. |
tap | file/stdout | Test Anything Protocol v14, for prove/tappy-style consumers. |
matrix | file | A self-contained HTML test-by-model comparison grid. The default output of a --models sweep. See Comparison matrix and model sweeps below. |
matrix-md | file/stdout | The comparison grid as GitHub-flavored Markdown. |
agent | text/file | Compact, model-facing output for a coding agent reading the result in a loop. See agent interface. |
openinference | file/stdout | OpenInference/OTLP-compatible trace spans as JSONL, for eval and observability platforms. See OpenTelemetry exports. |
A pretty, md, or html report can also carry a compliance score delta when the run is gated against a baseline; see compliance score delta.
SARIF
mcptest emits SARIF, the Static Analysis Results Interchange Format (v2.1.0), so the same run that proves your MCP server works can also surface findings in any tool that already speaks SARIF: GitHub code scanning, GitLab Ultimate, Azure DevOps Advanced Security, Sonar, Semgrep, and the VS Code SARIF Viewer.
Use SARIF when the consumer cares about findings, not test counts. A typical SARIF consumer renders each result inline on a pull request as a code annotation pointing at the YAML line that authored the test, and groups results by rule across runs. That is a better fit for the compliance suite than JUnit, which models results as test cases. If your CI consumer wants test totals (X passed, Y failed), keep JUnit. If it wants a security-style finding stream, use SARIF.
mcptest run --config examples/reference-server/tests/smoke.yml --reporter json --output run.json
mcptest report run.json --format sarif --output mcptest.sarif
Every failing test becomes one result entry:
| SARIF field | Source |
|---|---|
ruleId | TestResult::rule_id when set (PROTO-001, EDGE-005, ...). Falls back to mcptest.assertion.failed. |
level | MUST -> error, SHOULD -> warning, MAY -> note. Failure with no rule defaults to error. |
message.text | <test name>: <failure message>. |
locations[].physicalLocation.artifactLocation.uri | TestResult::file (path to the YAML that authored the test). |
locations[].physicalLocation.region.startLine | TestResult::line. |
properties.duration_ms, properties.cache_hit, properties.started_at, properties.ended_at, properties.severity | Per-test telemetry. SARIF consumers ignore unknown properties. |
The run object carries tool.driver.{name, version, informationUri, rules}. Every distinct rule_id seen in the run lands in the rules array with a shortDescription, fullDescription, helpUri (https://mcptest.sh/compliance/<RULE-ID>), and defaultConfiguration.level matching the RFC 2119 severity. Run-level timing lives under runs[0].properties.timing (total_duration_ms, cache_hits, started_at, ended_at).
Passing tests are not emitted as result entries because SARIF models a defect stream. Use the JSON reporter or JUnit reporter when you need the full pass list.
GitHub code scanning
name: mcptest
on:
pull_request:
push:
branches: [main]
jobs:
mcptest:
runs-on: ubuntu-latest
permissions:
security-events: write # required for upload-sarif
contents: read
steps:
- uses: actions/checkout@v4
- name: Install mcptest
run: curl -sSL https://download.mcptest.sh/install.sh | sh
- name: Run mcptest with SARIF output
run: |
mcptest run \
--config tests/mcp.yaml \
--reporter sarif \
--output mcptest.sarif
continue-on-error: true # let the SARIF upload run even on failure
- name: Upload SARIF to GitHub code scanning
if: always()
uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: mcptest.sarif
category: mcptest
GitHub displays each finding on the "Files changed" tab of the pull request, with the ruleId (for example PROTO-001) and a click-through to https://mcptest.sh/compliance/PROTO-001. The Security tab aggregates findings by rule so a regression spike on EDGE-* rules surfaces at a glance.
GitLab Code Quality
mcptest emits a JSON array compatible with GitLab Code Quality so failures surface as inline annotations on the merge request "Changes" tab. The same array can be consumed by any tool that speaks the GitLab Code Quality format (CodeClimate engines, GitLab CI custom widgets).
mcptest run --config examples/reference-server/tests/smoke.yml --reporter json --output run.json
mcptest report run.json --format gitlab --output gl-code-quality-report.json
Every failing test becomes one finding. Skipped tests render as minor findings so reviewers see them surface without flipping the pipeline status. Passing tests are omitted because the format models a defect stream.
| Field | Source |
|---|---|
description | <test name>: <failure message>. |
check_name | TestResult::rule_id when set (PROTO-001, EDGE-005, ...). Falls back to mcptest.assertion. |
fingerprint | SHA-256 hex of rule_id + file + line (or name + file + line when no rule is attached). Excludes the failure message and test name so cosmetic edits do not churn fingerprints across runs. |
severity | MUST -> critical, SHOULD -> major, MAY -> minor. Test failure with no rule attached -> major. Skipped tests -> minor. |
location.path | TestResult::file. |
location.lines.begin | TestResult::line. |
properties.duration_ms, properties.cache_hit | Per-test telemetry. GitLab ignores unknown keys. |
GitLab orders severities info < minor < major < critical < blocker. Mapping RFC 2119 severity directly to that ladder keeps the merge request badge meaningful: a MUST regression is critical (block the merge), a SHOULD regression is major (review required), and a MAY regression is minor (informational). When the test is not a compliance check, the safe default is major so reviewers still see the finding.
Fingerprints stay stable across reruns of the same test even when the failure message changes. Two distinct rules at the same location still produce different fingerprints.
GitLab CI
stages:
- test
mcptest:
stage: test
image: rust:1.85
script:
- curl -sSL https://download.mcptest.sh/install.sh | sh
- mcptest run
--config tests/mcp.yaml
--reporter gitlab
--output gl-code-quality-report.json
artifacts:
when: always
reports:
codequality: gl-code-quality-report.json
paths:
- gl-code-quality-report.json
expire_in: 1 week
allow_failure: true
reports.codequality is the well-known artifact GitLab consumes to populate the merge request widget. allow_failure: true lets the job upload the report even when mcptest exits non-zero; remove it once you want a failing run to block the merge.
Comparison matrix and model sweeps
The matrix reporter renders a run as a grid: one row per test case, one column per model (or prompt), and a pass/fail cell at each intersection with the score and a drill-down into the failing rows. It is the file-based equivalent of the side-by-side comparison view, written as a single self-contained HTML file (or a Markdown variant) with no server.
Use it when a run fans an agent test across a models: matrix and you want to see, at a glance, which model wins each test and where the failures cluster.
A matrix grid needs matrix data. A run produces it when an agent test declares a model matrix (models:) so the runner dispatches one cell per model. That matrix is carried on the report, so both the live run and a re-render from a saved report show the same grid:
agents:
- name: summarize
server: docs
prompt: "Summarize the attached document in three bullets."
models: ["gpt-4o", "claude-3-5-sonnet"]
expect:
- matcher:
factuality:
reference: "<the document>"
# Render at run time.
mcptest run suite.yml --reporter matrix --output matrix.html
# Or re-render a saved report later (same artifact).
mcptest run suite.yml --reporter json --output run.json
mcptest report run.json --format matrix --output matrix.html
When a report carries no matrix section (an ordinary, non-matrix run) the grid degrades to a single result column with one row per test, so the reporter is always well-defined.
Sweeping models from the CLI
You do not have to edit the YAML to compare models. --models runs the whole suite as a model matrix in one invocation: every agent test fans across the comma-separated list (overriding any models: it declared), one cell per model, and the run defaults to the matrix reporter so the output is the grid.
mcptest run suite.yml --models gpt-4o,claude-3-5-sonnet,gemini-2.0-flash
Bare ids are provider-auto-detected from their prefix (gpt-*, claude-*, gemini-*, ...). Pass an explicit --reporter (and --output) to override the default grid output. The exit code follows the usual rule: the run fails if any cell fails, so a green run means every model passed every agent test. Tool, resource, and prompt tests carry no model dimension, so they run once and are not part of the model grid.
matrix vs matrix-md
matrix(a self-contained HTML file): cells are colored (green pass, red fail, gray when a model produced no comparable row), show the score, and expand awhydrill-down listing the failing rows and their messages. A summary row gives the per-model pass rate; a summary column gives the per-test pass rate.matrix-md(GitHub-flavored Markdown): the same grid as a table. Because GFM table cells cannot hold block content, the per-cell drill-downs render as<details>blocks beneath the table under aFailuresheading.
Secrets are redacted from labels and drill-downs by the same redactor every other reporter uses.