CLI reference
Complete reference for every subcommand and global flag in the mcptest binary. Source of truth is crates/mcptest/src/cli/ (the Command enum in cli/mod.rs and the per-subcommand Args structs under cli/args/); this page mirrors that at the v1.0 cut. When a flag is wired to a stub handler, that is called out so a reader knows the implementation work is still pending.
For a friendlier introduction, start with getting-started.md. The YAML test format is documented in yaml-reference.md, and common failure modes are covered in troubleshooting.md.
Synopsis
mcptest [GLOBAL_OPTIONS] <SUBCOMMAND> [ARGS]
mcptest --help prints a clap-generated summary of every flag below. mcptest --version prints the build version. mcptest <SUBCOMMAND> --help prints the per-subcommand summary including any subcommand-specific flags.
Global flags are accepted before or after the subcommand name: both mcptest --debug run and mcptest run --debug parse identically.
Global options
These flags are declared on GlobalArgs in crates/mcptest/src/cli/global.rs. Every subcommand inherits them via #[command(flatten)], so they work uniformly regardless of which subcommand you call.
Output and logging
--no-color
- Type: boolean flag
- Default: off
- Description: disable ANSI color output. Useful in CI logs and when piping to tools that mishandle escape sequences.
- When to use it: any CI provider that captures raw stdout (GitHub Actions, GitLab CI, Buildkite) renders mcptest output more cleanly without color. Also useful when redirecting output to a file you plan to read later in a pager that does not handle ANSI.
--debug
- Type: boolean flag
- Default: off
- Description: enable debug logging. Sets
RUST_LOG=mcptest=debugif the variable is not already set, then initializes the tracing subscriber. - When to use it: you suspect mcptest is doing something different from what you asked. Debug logs include resolver decisions, HTTP request and response headers (redacted), and matcher dispatch.
--verbose
- Type: boolean flag
- Default: off
- Description: print extra detail about variable resolution and discovered files. Lighter than
--debug; aimed at humans rather than developers. - When to use it: you want to know which
.envfile shaped the run, or which--varoverrides ended up winning. The output is sorted and stable so it diffs cleanly across runs.
Logging
mcptest emits structured log events through tracing once the binary starts. The subscriber writes to stderr (stdout is reserved for reporter output) and is filtered by an EnvFilter resolved from the four sources below.
--log-level <LEVEL>
- Type: string. Either a single level (
off,error,warn,info,debug,trace) or aRUST_LOG-style directive likemcptest_core::cache=debug,mcptest_core::runner=trace. - Default:
warn(so a passing run emits no stderr). - When to use it: you want to dial verbosity for one run without exporting an env var. Invalid directives are rejected before the run starts so a typo never silently falls back to the default.
# Trace cache decisions for one debugging session.
mcptest --log-level "mcptest_core::cache=debug" run
# Quietest possible run; exit code is the only signal.
mcptest --log-level off run
Filter resolution precedence
Highest first:
--log-level <VAL>RUST_LOGenv var (standard convention).MCPTEST_LOGenv var. Use this when a parent process setsRUST_LOG=tracefor its own purposes and you do not want that flood to leak into mcptest output.--debug(back-compat: maps tomcptest=debug,mcptest_core=debug,mcptest_config=debug).--verbose(back-compat: maps tomcptest=info,mcptest_core=info,mcptest_config=info).- Built-in default:
warn.
Other logging knobs
NO_COLOR=1or--no-colordisables ANSI escapes in log output.MCPTEST_LOG_NO_TIME=1suppresses the RFC 3339 timestamp at the start of every log line. Useful when snapshotting stderr in tests.
What gets logged
mcptest_core::runner: run start/end, parallelism, per-test status and duration, fail-fast trips.mcptest_core::executor: action dispatch, deferred matcher hits.mcptest_core::connector: connect start/success per server, negotiated protocol version, shutdown errors.mcptest_core::protocol: request method and id; timeouts.mcptest_core::transport::stdio: child argv, pid, malformed JSON lines, child stderr.mcptest_core::transport::streamable_http: connect URL, non-2xx responses.mcptest_core::cache::store::fs: put/get/delete and eviction counts.mcptest_core::cache::eligibility: per-test cache decisions.mcptest_config::loader: YAML load + validate, schema-skip warnings.
Credentials are never logged. The connector and transport instrumentation redacts Authorization and any header name matching (?i)(token|secret|password|cookie|auth) to ***.
--quiet
- Type: boolean flag
- Default: off
- Description: suppress progress output (banners, per-second elapsed lines). Exit codes still communicate success and failure.
- When to use it: CI logs where the only signal that matters is the final status line. Combine with
--reporter junitso machine-readable output still lands in the JUnit file.
--reporter <FORMAT>
- Type: enum
{pretty, minimal, json, junit, md, html, sarif, gitlab, ndjson, tap, matrix, matrix-md, quiet} - Default:
pretty - Description: output format for
mcptest run. Pairs with--outputto pick the sink. The same formats (exceptminimal) are available onmcptest report --format. When to use it:
pretty: interactive shells and local development. Prints one line per test plus a summary, with failure detail inline.minimal: a compact one-line summary (ran N tests: ...) on stdout, with aFAILline per failure on stderr. The legacy default; handy for terse CI logs.json: tooling that consumes the full run record. Ajsonfile is the canonical recordmcptest reportre-renders.junit: a CI test reporter (GitHub Actions, GitLab, CircleCI Insights).md: a Markdown summary for a PR comment or job summary.html: a self-contained HTML build artifact.sarif: GitHub Code Scanning (see sarif-reporter.md).gitlab: GitLab Code Quality (see gitlab-code-quality.md).ndjson: one JSON record per line, for log pipelines andjq -c.tap: Test Anything Protocol v14, forprove/tappy-style consumers.matrix: a self-contained HTML test-by-model comparison grid (see matrix-reporter.md). The default output of a--modelssweep.matrix-md: the same comparison grid as GitHub-flavored Markdown.quiet: only the exit code matters. Equivalent to--quiet --reporter prettybut more explicit.
--output <PATH>
- Type: filesystem path. Use
-for stdout. - Default: none (the chosen format renders to stdout)
- Description: sink for the
--reporterformat. A path writes to that file; a file write also echoes a one-line summary to stderr so the result stays visible.-is an explicit stdout. - When to use it: you want the JUnit XML in a known location, or you want to capture the JSON run record so
mcptest reportcan re-render it later without a second run (--reporter json --output run.json).
--annotations <WHEN>
- Type: enum
{auto, always, never} - Default:
auto - Description: emit GitHub Actions inline annotations (
::error/::warning) to stderr alongside the normal output, one per failure.autoemits only inside Actions (GITHUB_ACTIONS=true);alwaysforces them;neverdisables them. Composes on top of any--reporterformat.
--color <WHEN>
- Type: enum
{auto, always, never} - Default:
auto - Description: colorize human-readable output (the pretty run/report summary and the
mcptest securityfindings).autocolors only when the output is a terminal and theNO_COLORenvironment variable (no-color.org) is unset;alwaysforces color even when piped;neverdisables it. A file sink (--output) and machine formats (JSON, SARIF, JUnit, NDJSON, ...) stay plain regardless.
Configuration sources
--config <PATH>
- Type: filesystem path
- Default:
./mcptest.yamlif present - Description: path to the mcptest YAML config. The file is parsed and validated against the JSON Schema before any test runs.
- When to use it: your test suite lives outside the repo root (a subdirectory, a separate config repo, a generated file in
/tmp). The path may be absolute or relative to the current working directory.
--env-file <PATH>
- Type: filesystem path
- Default: none
- Description: environment file to load before running. Lines look like
KEY=VALUE; quoted values, comments, and blank lines are tolerated. Loaded in addition to any.env,.env.local, or.env.testfiles discovered in the working directory. - When to use it: secrets live in a file named something other than
.env(a.env.staging, asecrets.envfrom a vault dump). Repeatable behavior is not supported in v1.0; pass one explicit env file at a time.
--no-env-file
- Type: boolean flag
- Default: off
- Description: skip auto-discovery of
.env,.env.local, and.env.test. - When to use it: CI where the runner injects every variable through real environment variables and a stray
.envchecked into the repo should not change behavior. Also useful when reproducing a failing CI run locally and you want to make sure your local.envis not silently winning.
--var KEY=VALUE
- Type: repeatable key-value pair
- Default: empty
- Description: override a variable. The key must be non-empty; the value may be empty (so
--var FOO=clears an inherited value). - When to use it: a one-off run with a different model name, base URL, or feature flag.
--varhas the highest precedence: it beats--env-file, auto-discovered dotenvs, and the process environment.
--show-secrets
- Type: boolean flag
- Default: off (values render as
***) - Description: print resolved variable values verbatim instead of redacting them.
- When to use it: only when you are debugging on your own machine, not in a shared terminal, and not in CI. Anything the resolver sees becomes visible: bearer tokens, API keys, internal URLs. Treat this like
set -xfor secrets.
Test selection and execution
--filter <EXPR>
- Type: string expression
- Default: none (run every test)
- Description: filter expression. Only tests whose name or tag matches the expression run. Substring match in v1.0; richer query syntax is planned.
- When to use it: you are iterating on a single test or a tag like
@smokeand do not want to wait for the full suite.
--parallel <N>
- Type: integer (0 means auto)
- Default: auto (typically the CPU core count)
- Description: maximum number of tests to run in parallel.
0defers to the runner; any positive integer pins the cap. - When to use it: your MCP server has rate limits and you want to keep concurrency low, or you are debugging a flaky test and want
--parallel 1to remove interleaving.
--timeout <SECONDS>
- Type: integer seconds
- Default: value from the YAML config, or the runner default
- Description: per-test timeout, in seconds. Overrides any value in the config. Accepts whole numbers only.
- When to use it: a particular run has a slow target (a cold-started Lambda, a remote dev machine), or you want a strict timeout in CI to avoid runaway jobs.
--retry <N>
- Type: integer
- Default: 0
- Description: retry each failing test up to N times before counting it as failed. Retries do not paper over real bugs; they smooth out network blips against flaky third-party services.
- When to use it: flake-prone integrations that you cannot improve directly. Always file a follow-up to fix the underlying flake;
--retryis a CI patch, not a fix.
--watch
- Type: boolean flag
- Default: off
- Description: re-run on file changes (watch mode). Stub in v1.0; live wiring is still pending.
- When to use it: interactive development. Until the wiring lands the flag parses but does not yet block on file events.
--wait-for-ready[=DURATION]
- Type: optional duration with units. Accepts a bare integer (seconds), or
Ns,Nm,Nh. Examples:30,30s,2m,1h. - Default: not set. When passed without
=..., defaults to60s. - Description: before the run connects, poll every URL server until it accepts a TCP connection, or fail after the configured budget. The driver uses exponential backoff (250ms, 500ms, 1s, 2s, then 5s steady). On
mcptest runthe budget is shared across all URL servers; onmcptest doctorit polls the--urltarget. A server that never comes up exits3; stdio servers are spawned by mcptest, so they are skipped (nothing to wait for). Once the listener is up the normal connect proceeds, so a 401 or protocol error surfaces through the run's own fast-fail rather than the readiness loop. - When to use it: preview-deploy CI where the target spins up alongside the test job and is not immediately reachable.
--wait-for-ready=60sis a sane default; bump it for cold-started containers.
Server target overrides
These flags let you change the server: block in the YAML suite at run time. They are useful for preview deploys and CI matrices where the YAML is authored without knowing the target URL.
--server-url <URL>
- Type: URL string
- Default: none (use the in-suite
server:block) - Description: override
server.urlfor every server in the YAML suite. Mutually exclusive with--server-command. Wins over the same field in--server-config. - When to use it: preview deploy on a PR-specific URL, or running the same YAML against staging and production in two consecutive CI steps.
--server-command <CMD>
- Type: shell-quoted command string
- Default: none
- Description: override
server.commandfor every server in the YAML suite. The argument is split using POSIX shell rules (theshell-wordscrate), so--server-command "./dev-server --debug"parses to["./dev-server", "--debug"]. Mutually exclusive with--server-url. Wins over the same field in--server-config. - When to use it: you want to point a YAML written for a remote URL at a locally-built binary instead, without editing the YAML.
--server-auth-bearer-env <NAME>
- Type: env variable name
- Default: none
- Description: set
server.auth.bearer_token_envfor URL targets. The runner readsNAMEfrom the environment at connect time and sends the value asAuthorization: Bearer <value>. Wins over the same field in--server-config. - When to use it: the YAML hard-codes the bearer env name for production (
PROD_BEARER_TOKEN) and a preview environment uses a different secret name (PREVIEW_BEARER_TOKEN).
--server-config <PATH>
- Type: filesystem path
- Default: none
- Description: load a YAML file containing a full
server:block and use it in place of the in-suite server block. Lower precedence than the single-field flags above: when a field appears in both--server-configand a flag like--server-url, the explicit flag wins. - When to use it: you maintain per-environment server snippets (
server.prod.yaml,server.staging.yaml) and want to swap them in without editing the main suite.
HTTP transport
--header NAME=VALUE
- Type: repeatable key-value pair
- Default: empty
- Description: add a literal HTTP header.
AuthorizationandProxy-Authorizationare rejected; credentials must live inauth:. The value is sent verbatim. - When to use it: a custom header expected by the server (
X-Tenant-ID,X-Request-Source). Combine with--header-envwhen the value should come from an env var.
--header-env NAME=VAR_NAME
- Type: repeatable header-name to env-var-name pair
- Default: empty
- Description: add an env-backed HTTP header. The runner reads
VAR_NAMEfrom the environment at connect time and uses its value as the header value. Both names must be non-empty. - When to use it: secret-flavored custom headers (multi-tenant SaaS keys, internal trace tokens) you do not want to put on the command line.
--insecure-skip-verify
- Type: boolean flag
- Default: off
- Description: disable TLS certificate verification. Dangerous; only use against a private staging endpoint with a self-signed certificate. The runner prints a banner whenever this flag is set so the operator can spot a misconfigured CI job.
- When to use it: local staging environments behind a self-signed cert. Do not use against production endpoints, ever.
--ca-bundle <PATH>
- Type: filesystem path to a PEM-encoded CA bundle
- Default: system trust store
- Description: path to a PEM-encoded CA bundle for HTTP transport.
- When to use it: your organization runs an internal CA whose root is not in the system trust store. Preferable to
--insecure-skip-verify.
--http-timeout <SECONDS>
- Type: integer seconds
- Default: value from
server.http.timeout, or the runner default - Description: HTTP per-request timeout, in seconds. Overrides
server.http.timeout. - When to use it: a slow upstream needs longer per-request budgets, or you want a strict bound for CI fairness.
--connect-timeout <SECONDS>
- Type: integer seconds
- Default: value from
server.http.connect_timeout, or the runner default - Description: HTTP connect timeout, in seconds. Overrides
server.http.connect_timeout. - When to use it: behind a flaky NAT or VPN; a tighter connect budget will trip faster than waiting for the per-request timeout.
Proxy
Proxy flags apply to every outbound HTTP client mcptest builds: the StreamableHTTP and legacy SSE transports for MCP servers, plus the LLM provider clients (Anthropic, OpenAI, Google, Mistral, and any custom OpenAI-compat provider declared under providers:).
When no flag is set, reqwest reads HTTP_PROXY, HTTPS_PROXY, and NO_PROXY from the environment, so users behind a corporate proxy who already export those variables get the right behavior without changing anything. Use the flags below to override or disable.
--proxy <URL>
- Type: URL string
- Default: unset; reqwest reads
HTTP_PROXY/HTTPS_PROXYfrom the environment - Description: catch-all proxy for both HTTP and HTTPS targets.
- When to use it: a corporate proxy that handles both schemes.
--http-proxy <URL>
- Type: URL string
- Default: unset
- Description: HTTP-only proxy. Wins over
--proxyfor plain-HTTP targets. - When to use it: separate proxies per scheme.
--https-proxy <URL>
- Type: URL string
- Default: unset
- Description: HTTPS-only proxy. Wins over
--proxyfor HTTPS targets. - When to use it: most corporate setups route only HTTPS through a TLS-terminating CONNECT proxy.
--no-proxy
- Type: boolean flag
- Default: off
- Description: disable every proxy, including reqwest's automatic env-var pickup. Mutually exclusive with
--proxy,--http-proxy, and--https-proxy. - When to use it: the shell has a system-wide
HTTPS_PROXYset but this single run should go direct.
--noproxy <HOSTLIST>
- Type: comma-separated hostnames or domain patterns
- Default: empty
- Description: hosts to bypass the proxy for. Each entry is an exact hostname or a leading-dot suffix (
.internal.example). MirrorsNO_PROXYsemantics. - When to use it: route everything through the corporate proxy except your localhost test server or internal staging domain.
Verify what is in effect with mcptest doctor (prints a one-line proxy: summary) or mcptest run --print-config (includes the same summary plus the resolved test list).
Upload reporter
These flags are read by mcptest report --format upload. They parse on every command but have no effect outside that path.
--upload-endpoint <URL>
- Type: URL string
- Default: none
- Description: endpoint to POST the canonical run envelope to. The body shape is documented in
schemas/wire/v0.json. - When to use it: shipping run results to a collector (an mcptest-cloud instance, an internal observability service).
--upload-token-env <NAME>
- Type: env variable name
- Default: none
- Description: name of the env var that holds the bearer token for the upload reporter. Read at config-build time and sent verbatim as
Authorization: Bearer <value>. - When to use it: the collector requires authentication. Keep the token out of the command line and out of CI logs by referencing it through an env var.
--upload-organization <NAME>
- Type: string
- Default: none
- Description: optional organization tag included in the upload envelope.
- When to use it: multi-tenant collectors that need to attribute runs to a specific organization. The collector treats this as an opaque label.
Subcommands
run
mcptest [GLOBAL_OPTIONS] run
Description. Run the test suite. This is the primary command and the one you will type most often.
The runner loads the YAML config (default ./mcptest.yaml or whatever --config points to), resolves variables, applies any server-target overrides, and executes each test against the configured server. Results are printed by the reporter selected with --reporter (default pretty).
Arguments.
| Argument | Type | Required | Description |
|---|---|---|---|
--update-snapshots, -u | flag | optional | Rewrite every snapshot fixture encountered during the run. |
--allow-update-in-ci | flag | optional | Permit --update-snapshots even when CI=true is set. |
--models <ID,ID,...> | list | optional | Run the suite as a model matrix: every agent test fans across this comma-separated list (one cell per model), and the run defaults to the matrix reporter. See matrix-reporter.md. |
--no-verdict-cache | flag | optional | Disable the LLM-judge verdict cache for this run. Overrides evals.cache.verdicts: true in YAML. |
--coverage | flag | optional | Record per-tool, per-argument, per-error-path, and per-transport coverage during the run. Folds the result into JSON, pretty, markdown, and HTML reporters. |
--coverage-threshold <SPEC> | string | optional | Quality gate against the coverage report. Accepts tools=80,args=60,error_paths=50,transports=100. Exits with code 6 when any dimension falls below its threshold. Requires --coverage. |
--no-cache | flag | optional | Bypass the content-addressed cache for this run. Equivalent to passing both --no-cache-read and --no-cache-write. |
--no-cache-read | flag | optional | Write fresh entries but ignore existing ones. Equivalent to "refresh the cache on this run." |
--no-cache-write | flag | optional | Read existing entries but do not update the cache. Equivalent to "freeze the cache for this run." |
--cache-filter <SET> | string | optional | Restrict execution to tests matching the named cache set. NEW is the only value v1 ships. |
--record | flag | optional | For agent tests, dispatch every model live and overwrite the cassette on disk. Default behavior replays the cassette when present. |
--bail, -x | flag | optional | Stop the runner after the first failing test. Subsequent tests are reported as skipped. |
--maxfail <N> | integer | optional | Stop after the Nth failing test. Implies --bail when N=1. |
--collect-only, --list-tests | flag | optional | Print discovered tests and exit 0 without running them. Honors --filter. Useful for verifying a selector before a long run. |
--pass-with-no-tests | flag | optional | Treat "zero tests selected" as success. Without this flag, a run that picks nothing (after --filter, --shard, or --last-failed) exits 7. |
--shard <INDEX/TOTAL> | string | optional | Run a deterministic slice of the discovered tests. One-based; --shard 1/3 runs the first third. The partition is stable across runs. Pair with --pass-with-no-tests so a worker with an empty slice does not fail the CI matrix. |
--last-failed, --lf | flag | optional | Run only the tests that failed on the previous invocation. Reads .mcptest/last-run.json (rewritten after every run). |
--failed-first, --ff | flag | optional | Reorder the test list so previous failures run first. Every selected test still runs. Mutually exclusive with --last-failed. |
--print-config | flag | optional | Print the resolved suite (servers, providers, budget, selected tests) and exit. Provider API key env vars are listed by name, never resolved values. |
--tag <NAME> | string (repeatable) | optional | Run only tests whose YAML tags: list contains NAME. Multiple --tag flags are OR'd. |
--skip-tag <NAME> | string (repeatable) | optional | Drop tests whose tags: list contains NAME. Applied after --tag so a test matching both is dropped. |
--random | flag | optional | Shuffle the test order to surface hidden ordering dependencies. The seed is logged so you can reproduce with --seed N. Conflicts with --failed-first. |
--seed <N> | integer | optional | Pin the shuffle seed for --random. Implies --random. |
--changed | flag | optional | Run only tests whose YAML or referenced cassette changed in the git working tree against --changed-base (default origin/main). Outside a git repo the selection is empty, so pair with --pass-with-no-tests. |
--changed-base <REF> | git ref | optional | Base ref for --changed. Default origin/main. |
-- <SERVER_ARG>... | trailing args | optional | Any argument after -- is appended to every stdio server's command line. Ignored for HTTP / SSE servers. |
Agent-test env vars. Auto-detected provider families read these at run time. Missing keys are not an error; the runner falls back to a deterministic stub so CI stays green.
| Family | Env var | Notes |
|---|---|---|
| Anthropic | ANTHROPIC_API_KEY | Claude models (claude-*). |
| OpenAI | OPENAI_API_KEY plus optional OPENAI_ORG_ID | gpt-*, chatgpt-*, o<digit>*, text-*, davinci-*. |
GEMINI_API_KEY (falls back to GOOGLE_API_KEY) | gemini-*. | |
| Mistral | MISTRAL_API_KEY | mistral-*, codestral-*, etc. |
Custom OpenAI-compatible endpoints (Azure, OpenRouter, vLLM, LiteLLM, Together, Groq, Anyscale, Fireworks) are declared under top-level providers: in the YAML and reference whatever env var name you give them. See docs/models.md.
Examples.
# Smallest possible invocation; uses ./mcptest.yaml.
mcptest run
# Run a specific suite with JUnit output for CI.
mcptest --config tests/mcp.yaml --reporter junit --output reports/junit.xml run
# Point the suite at a preview deploy and wait for readiness.
mcptest --server-url https://preview-42.example.com \
--wait-for-ready=2m \
run
# Iterate on a single tag locally with debug logging.
mcptest --debug --filter '@smoke' run
# Record coverage and gate the run on a two-dimension threshold.
mcptest run --coverage --coverage-threshold tools=80,args=60
# Record agent cassettes against every key set in the environment.
ANTHROPIC_API_KEY=... OPENAI_API_KEY=... mcptest run --record
# Show which tests the runner would execute without running them.
mcptest --filter weather run --collect-only
# Stop at the first failure (developer's inner loop).
mcptest run --bail
# Iterate on yesterday's failures only; bail on the first one.
mcptest run --last-failed --bail
# CI matrix: three workers each take a stable third of the suite.
mcptest run --shard 1/3 --pass-with-no-tests
# See what the runner would dispatch without connecting to anything.
mcptest run --print-config
# Gate the run on a saved timing baseline.
mcptest run --check-baseline mcptest.timing-baseline.yml --tolerance-pct 25
Timing baseline (--check-baseline). Compares the run against a saved baseline file (written by mcptest baseline update). Tests whose elapsed milliseconds exceed p90 * (1 + tolerance_pct / 100) print a regression line and exit non-zero. The default tolerance is 0 (any overrun is a regression); a busy CI runner may want 25 or 50 to absorb noise. Cache hits and skipped tests are excluded from the check.
Exit codes. 0 on success, 1 on any test failure or baseline regression, 2 on configuration error, 3 on --wait-for-ready timeout, 6 on --coverage-threshold miss, 7 on "no tests selected" (use --pass-with-no-tests to treat as success), 124 on a hard runner timeout.
baseline
Manage the timing-baseline file the mcptest run --check-baseline gate reads. v1 ships one subcommand:
mcptest baseline update [--samples 20] --from-report run.json BASELINE
update reads a saved canonical JSON run report (mcptest run --reporter json --output run.json produces one) and folds each non-skipped, non-cache-hit test row into the baseline file. When the baseline file does not exist, it is bootstrapped from the report; when it does, the per-test p50 / p90 are blended via a rolling window (--samples, default 20). Tests present in the baseline but absent from this run are preserved so a transient --tag filter does not silently delete them.
Workflow.
# Refresh the baseline from a clean run.
mcptest run --reporter json --output run.json
mcptest baseline update --from-report run.json mcptest.timing-baseline.yml
# Gate subsequent runs against the saved baseline.
mcptest run --check-baseline mcptest.timing-baseline.yml --tolerance-pct 25
conformance
Score a running MCP server against the vendored SEP corpus, refresh the corpus from upstream, or list which check ids the corpus carries. Three subcommands:
mcptest conformance run [FLAGS] # score against the corpus
mcptest conformance refresh [FLAGS] # pull the latest SEPs
mcptest conformance check-ids [FLAGS] # list check ids
The corpus ships baked into the binary at compile time, so a cargo install mcptest user can score offline. The resolution order for the corpus directory is: --corpus-dir if set, else the XDG user cache (~/.cache/mcptest/conformance/), else the embedded fallback. Each subcommand surfaces corpus_source on its report so a reader can tell which path the run used.
conformance run flags.
| Flag | Default | Purpose |
|---|---|---|
--server <URL> | none | MCP server to probe. Wire-probe integration ships as a follow-up; the v1 report scores from the corpus content only. |
--target-version <V> | latest available | Which spec revision to score. Defaults to the lexicographically greatest version available locally. |
--corpus-dir <PATH> | (resolution order) | Override the corpus location. |
--format <FORMAT> | pretty | One of pretty, json, markdown, html. |
--out <PATH> | stdout | Where to write the report. |
--auto-refresh | off | Trigger refresh when the requested version is missing locally. Off by default so a run never silently makes a network call. |
conformance refresh flags.
| Flag | Default | Purpose |
|---|---|---|
--target-version <V> | latest | Spec version to fetch. latest resolves to the newest entry in mcptest_core::conformance::releases::KNOWN_RELEASES. |
--corpus-dir <PATH> | user cache | Destination. Never special-cases the in-repo crates/mcptest-core/seps/ path; maintainers pass it explicitly when refreshing the vendored copy. |
--url <URL> | upstream | Override the upstream repository. |
--ref <REF> | (from KNOWN_RELEASES) | Pin to a specific tag or SHA. |
--source-path <PATH> | src/seps | Subdirectory in the upstream tree to mirror. |
--dry-run | off | Print what would be fetched and where, without writing. |
The refresh transport is an HTTPS GET to codeload.github.com/<owner>/<repo>/tar.gz/<ref>, extracted in memory. Set GITHUB_TOKEN to lift GitHub's 60-req/hr anonymous rate limit.
conformance check-ids flags.
| Flag | Default | Purpose |
|---|---|---|
--target-version <V> | latest available | Which corpus to inspect. |
--corpus-dir <PATH> | (resolution order) | Override the corpus location. |
--missing-only | off | Print only the unimplemented check ids. |
--format <FORMAT> | pretty | One of pretty or json. |
# Score the embedded corpus and write the JSON envelope.
mcptest conformance run --format json --out report.json
# Refresh the user cache from upstream (no token needed for the
# default rate-limited path).
mcptest conformance refresh
# List the check ids the runner has not implemented yet.
mcptest conformance check-ids --missing-only
init
mcptest [GLOBAL_OPTIONS] init [--with-jury] [--force] [--url URL | --from-discovered NAME]
Description. Scaffold a new mcptest project in the current directory. Creates a starter tests/example.yaml and a mcptest.yml config. Safe to run inside an empty directory; refuses to overwrite existing files unless --force is supplied. --url forks to a URL-target template; --from-discovered scaffolds from a server found in a local MCP client config (see doctor).
Arguments.
| Argument | Type | Required | Description |
|---|---|---|---|
--with-jury | flag | optional | Append a v1.0 LLM-judge example to tests/example.yaml. The block is marked as a v1.0 feature in a comment so users understand it is forward-looking. |
--force | flag | optional | Overwrite existing files. Default behavior is to refuse. |
--from-discovered <NAME> | string | optional | Scaffold a stdio suite from a server discovered in a local MCP client config (the names mcptest doctor lists). Conflicts with --url. Env-var names are surfaced as a comment; their values are never copied into the scaffold. |
Examples.
# Scaffold a project in the current directory.
mkdir my-mcp-tests && cd my-mcp-tests
mcptest init
# Include the v1.0 jury example.
mcptest init --with-jury
# Scaffold from a server already configured in a local MCP client.
mcptest init --from-discovered github
# Overwrite stale scaffolding.
mcptest init --force
Exit codes. 0 on success, 2 when a target file already exists and --force was not supplied.
Status. Working.
doctor
mcptest [GLOBAL_OPTIONS] doctor [--no-tool-tokens] [--tokenizer NAME]
Description. Diagnose the local environment and server connectivity. Lists which dotenv files were discovered, how many variables came from each source, and (when wired) the cost of the server's tools/list catalog measured in tokens. It also prints a test-readiness inventory of the MCP servers found across local client configs (Claude Desktop, Claude Code, Cursor, VS Code, Windsurf, Codex), showing identity, transport, and presence-of-auth only, with secrets redacted (see server discovery).
Alongside the token total, doctor reports a tool-search posture signal: friendly or heavy. A catalog is friendly when its token cost is at or under 20K and it advertises ten tools or fewer; otherwise it is heavy. The threshold sits well under the roughly 55K-token cost of a five-server MCP setup that Anthropic's advanced-tool-use guidance describes, where real systems reach about 134K and the Tool Search Tool defers definitions to cut catalog token cost by about 85 to 95 percent. A friendly catalog is cheap enough to load up front; a heavy catalog is large enough that deferred loading (tool search) would pay off.
The pure computations behind doctor (env discovery, tokenizer accounting, posture classification) are fully unit-tested. The live tools/list probe is still pending.
Arguments.
| Argument | Type | Required | Description |
|---|---|---|---|
--no-tool-tokens | flag | optional | Disable the tool catalog token cost check. Use when the server is unreachable and you want the rest of the doctor report. |
--tokenizer <NAME> | string | optional (default cl100k_base) | Override the tokenizer used for the catalog token cost check. Supported: cl100k_base (GPT-3.5/4), o200k_base (GPT-4o), gpt2, claude (currently aliased to cl100k_base), and whitespace (transport-free approximation). |
--lint-descriptions | flag | in-flight | Run the catalog description quality lint. Not present in the v1.0 binary; it will get its own exit code when the upcoming release ships. |
Examples.
# Default doctor run.
mcptest doctor
# Skip the live tools/list probe (offline triage).
mcptest doctor --no-tool-tokens
# Use a specific tokenizer for the catalog cost report.
mcptest doctor --tokenizer o200k_base
# Combine with --verbose to also see resolver decisions.
mcptest --verbose doctor
Exit codes. 0 when the report renders successfully. 1 is reserved for a doctor probe that fails outright. 7 is reserved for --lint-descriptions failures.
Migration probe (--target-version). Pair --url with --target-version 2026-07-28 to run the migration doctor. It adds a one-shot initialize probe after the regular pipeline and reports one row per breaking change from the migration pair-corpus. v1 detects the deprecated capabilities (Roots / Sampling / Logging); other categories surface as [SKIP] with a one-line rationale and a follow-up ticket reference (stateless transport, schema validator, auth pack). Pair with mcptest lint for the offline YAML and cassette scan. A [FAIL] row gates CI (exit 1).
mcptest doctor --url https://mcp.example.com --target-version 2026-07-28
Status. Working. The tool catalog token check is wired but the live tools/list call is still pending; until then the handler reports that the check is wired but not yet runnable. The tool-search posture signal is computed from the same catalog token cost. --lint-descriptions is in-flight.
validate
mcptest [GLOBAL_OPTIONS] validate
Description. Validate the YAML config against the published JSON Schema (schemas/v1.json). Useful as a pre-commit hook and as the first step in any CI pipeline: it catches typos and structural mistakes before any test runs.
The path to validate is taken from the global --config flag so behavior is consistent with run.
Arguments.
| Argument | Type | Required | Description |
|---|---|---|---|
| (none) | The file to validate is taken from --config or ./mcptest.yaml. |
Examples.
# Validate the default ./mcptest.yaml.
mcptest validate
# Validate a specific suite (useful in a multi-suite repo).
mcptest --config tests/integration/mcp.yaml validate
# Run validate as a pre-commit step.
git diff --cached --name-only | grep -q '\.ya\?ml$' && mcptest validate
Exit codes. 0 when the file parses and validates. 2 on a schema violation, broken ${VAR} reference, missing import, or unreadable file. Every finding from every layer is reported in a single pass.
Status. Working.
schema
mcptest [GLOBAL_OPTIONS] schema [--version v1]
Description. Emit the JSON Schema for the YAML config to stdout. Output is byte-equivalent to https://mcptest.sh/schema/v1.json. Use it to wire mcptest into IDEs (VS Code's YAML extension, IntelliJ) so authors get autocomplete and inline validation while they type.
Arguments.
| Argument | Type | Required | Description |
|---|---|---|---|
--version <VERSION> | string | optional, default v1 | Schema version to emit. Only v1 is shipped today; a future v2 will land as a separate match arm. |
Examples.
# Print the schema to stdout.
mcptest schema
# Pipe it into a tool that consumes JSON Schema.
mcptest schema > .vscode/mcptest.schema.json
# Validate ad-hoc YAML against the schema using a third-party tool.
mcptest schema | check-jsonschema --schemafile - my-tests.yaml
Exit codes. 0 on success. 2 when --version names an unknown schema revision.
Status. Working.
coverage
mcptest [GLOBAL_OPTIONS] coverage [--threshold PERCENT] [--format FORMAT]
Description. Compute per-tool and per-resource coverage metrics for the server's surface. Reports which tools and resources were exercised by the test suite and which were skipped, so authors can spot dead corners of their server.
The pure computation in mcptest_core::coverage is fully unit-tested. Running it requires a runner that records which tools and resources were exercised at execution time.
Arguments.
| Argument | Type | Required | Description |
|---|---|---|---|
--threshold <PERCENT> | float, 0 to 100 | optional | Quality gate threshold as a percentage. When set and the computed coverage is below the threshold, the runner exits with code 6. |
--format <FORMAT> | enum {pretty, json} | optional, default pretty | Output format for the coverage report. pretty renders a human-friendly table; json emits the structured CoverageReport. |
Examples.
# Local check after a test run, pretty output.
mcptest coverage
# Hard gate at 80% coverage in CI.
mcptest coverage --threshold 80
# Machine-readable JSON for downstream tooling.
mcptest coverage --format json > coverage.json
# Combine with --filter to scope coverage to a subset.
mcptest --filter '@public' coverage --threshold 90
Exit codes. 0 when coverage is computed (and meets the threshold when one is set). 6 when --threshold is set and the report does not meet it.
Status. Stub in v1.0. The handler prints the requested threshold and format, then exits 0. Live wiring lands when the runner records exercised tools and resources.
report
mcptest [GLOBAL_OPTIONS] report <INPUT> [--format FORMAT] [--output PATH]
Description. Re-render a previously-saved JSON run as another reporter format. Saves a re-run when CI already captured the canonical JSON but a different consumer (a PR comment, a GitHub job summary, a SARIF importer) wants a different shape.
The input is the JSON written by mcptest run --output run.json --reporter json or any equivalent invocation. The redaction policy is re-applied at the dispatch site so every output shape shares one redacted view.
Arguments.
| Argument | Type | Required | Description |
|---|---|---|---|
<INPUT> | filesystem path | yes | Path to a JSON report previously written by mcptest run. |
--format <FORMAT> | enum, see below | optional, default pretty | Reporter format to render in. |
--output <PATH> | filesystem path | optional | Write to this file instead of stdout. |
Accepted --format values:
| Format | Description |
|---|---|
pretty | Human-friendly text output (default). |
json | Pretty-printed JSON, round-trippable through the same model. |
junit | JUnit XML suitable for dorny/test-reporter and CircleCI Insights. |
md | GitHub-flavored Markdown for PR comments and job summaries. |
html | Single-file HTML report with inline CSS. |
sarif | SARIF 2.1.0 for GitHub code-scanning and similar consumers. See sarif-reporter.md. |
gitlab | GitLab Code Quality JSON for merge request widgets. See gitlab-code-quality.md. |
ndjson | Newline-delimited JSON: one test record per line, then a summary. For log pipelines and jq -c. |
tap | Test Anything Protocol v14, for prove/tappy-style consumers. |
matrix | Self-contained HTML test-by-model comparison grid. See matrix-reporter.md. |
matrix-md | The comparison grid as GitHub-flavored Markdown. |
upload | HTTPS upload of the canonical run envelope to --upload-endpoint (preview). |
Examples.
# Re-render a saved run as JUnit XML for CI.
mcptest report run.json --format junit --output junit.xml
# Produce a Markdown summary for a PR comment.
mcptest report run.json --format md --output pr-summary.md
# Generate an HTML report for sharing in Slack.
mcptest report run.json --format html --output run.html
# Ship a run envelope to a collector.
mcptest --upload-endpoint https://collector.example.com/v1/runs \
--upload-token-env COLLECTOR_TOKEN \
--upload-organization acme \
report run.json --format upload
# Print the round-trippable JSON to stdout.
mcptest report run.json --format json
Exit codes. 0 on a successful render or upload. 1 when the input file is missing or malformed, or when the upload CLI is misconfigured (no endpoint, bad URL). 2 when the collector returns an error or declines the upload.
Status. Working. Every format above is wired. The upload format is documented as a preview because the wire envelope schema is not yet finalized.
eval
mcptest [GLOBAL_OPTIONS] eval [--max-cost USD] [--no-verdict-cache] [--explain]
Description. Run quality evaluations against an MCP server using an LLM judge. v1.0 ships single-judge mode: every entry in the evals: block is graded by one juror, the verdict and rationale are pretty-printed, and a cost budget tracks cumulative spend. Multi-juror consensus, bias mitigations, and inter-juror agreement are v1.0 features.
Arguments.
| Argument | Type | Required | Description |
|---|---|---|---|
--max-cost <USD> | float | optional | Hard ceiling in USD across every LLM-judge call. Accepts an optional leading $. The runner stops dispatching new tests once cumulative spend would exceed the cap. |
--no-verdict-cache | flag | optional | Disable the LLM-judge verdict cache for this run. Overrides evals.cache.verdicts: true in YAML. |
--explain | flag | optional | Print what each eval would grade (rubric, candidate source, judge model, judge-call count) without calling any provider or spending tokens, then exit. |
Examples.
# Run every eval in mcptest.yaml.
mcptest eval
# Cap total LLM-judge spend at one dollar.
mcptest eval --max-cost $1.00
# Force fresh verdicts even when the YAML opts into caching.
mcptest eval --no-verdict-cache
Exit codes. 0 on success. 1 on a failed evaluation. 5 when the configured cost cap is exceeded.
Status. Working single-judge mode. Multi-juror consensus follows in v1.0.
diff
mcptest [GLOBAL_OPTIONS] diff <OLD> <NEW> [--format FORMAT] [--fail-on-breaking BOOL] [--scorecard]
Description. Diff two saved tools/list JSON snapshots and report which tools were added, removed, or reshaped, flagging each change as breaking or non-breaking so a CI job can fail loudly on a real regression. The --scorecard flag appends a release grade summarizing the diff.
Arguments.
| Argument | Type | Required | Description |
|---|---|---|---|
<OLD> | filesystem path | yes | Path to the old snapshot (the baseline). |
<NEW> | filesystem path | yes | Path to the new snapshot (the candidate). |
--format <FORMAT> | enum {pretty, json, markdown} | optional, default pretty | Output format for the diff report. |
--fail-on-breaking <BOOL> | boolean | optional, default true | Exit code 1 when at least one change is breaking. Set to false for advisory CI output rather than a hard gate. |
--scorecard | flag | optional, default off | Append a release scorecard (A+ / A / B / C / D / F letter grade plus per-tool added / removed / regressed callouts) to the diff output. |
Scorecard grading. Aggregates the diff into one letter grade:
| Grade | Trigger |
|---|---|
A+ | No changes at all. Identical snapshots. |
A | At least one safe change, no breaking changes. |
B | Exactly one breaking change. |
C | Two or three breaking changes. |
D | Four or more breaking changes. |
F | Any tool removed between old and new (the most disruptive change a server can ship). |
The grade matches the spirit of the compliance grade table so the two scorecards line up in marketing material.
Examples.
# Local diff against the previous saved snapshot, pretty output.
mcptest diff snapshots/old.json snapshots/new.json
# Hard fail in CI when any change is breaking.
mcptest diff snapshots/main.json snapshots/pr.json --fail-on-breaking true
# Advisory output for a PR comment (does not fail the job).
mcptest diff snapshots/main.json snapshots/pr.json \
--format markdown \
--fail-on-breaking false > pr-comment.md
# Machine-readable JSON for downstream tooling.
mcptest diff snapshots/main.json snapshots/pr.json --format json
Exit codes. 0 when there are no breaking changes (or --fail-on-breaking false is set). 1 when --fail-on-breaking is true and at least one change is breaking, or when a snapshot file is missing or malformed.
Status. Working. The diff engine lives in mcptest_core::diff; the CLI handler loads the JSON snapshots, calls diff_tool_catalogs, and renders the report.
lint
mcptest [GLOBAL_OPTIONS] lint [PATH...] [--format pretty|json] [--no-fail]
Scan YAML suites and JSON cassettes for usage of features the MCP 2026-07-28 spec deprecates. One-shot, offline: no server is contacted. Walks every *.yaml / *.yml / *.json file under each PATH (defaults to the current directory) and emits one finding per hit. target/, node_modules/, and .git/ are skipped.
Detected patterns:
| Kind | Trigger | Replacement |
|---|---|---|
roots | roots/list method | Tool parameters, resource URIs, or server configuration. |
sampling | sampling/createMessage method | Direct integration with LLM provider APIs. |
logging | logging/setLevel or notifications/message method | stderr for stdio servers; OpenTelemetry for structured observability. |
tasks-list | tasks/list method (removed in the Tasks extension lifecycle) | tasks/get polling on a server-issued task handle. |
legacy-error-code | Literal -32002 error code | Standard JSON-RPC -32602 Invalid Params code. |
Exit code. 0 when no findings land or with --no-fail. 1 when any finding is reported. Useful as a CI gate.
Example.
mcptest lint examples/ docs/
mcptest lint --format json suites/ > deprecations.ndjson
mcptest lint --no-fail . # advisory mode
Status. Working. The migration doctor will add a live probe that complements this offline scan.
migrate
mcptest [GLOBAL_OPTIONS] migrate [PATH...] [--to 2026-07-28] [--write]
Rewrite YAML suites toward an MCP spec revision. v1 covers the 2026-07-28 target only; any other --to value is a clear error. Two kinds of rewrite ship:
- Annotation. Every deprecated-feature hit gets a
# TODO(mcptest-migrate)comment inserted immediately above the offending line, pointing at the replacement guidance from the migration corpus. The original line is left intact so the file still parses and so the operator can apply the human-judgement rewrite. - Mechanical rewrite. The legacy
-32002JSON-RPC error code has a safe one-to-one replacement (-32602Invalid Params). The migrator annotates the line with a TODO and substitutes the literal token.
The same deprecation catalog mcptest lint uses drives the migrator, so a run that lints clean migrates as a no-op.
Flags.
| Flag | Default | Description |
|---|---|---|
--to <VERSION> | 2026-07-28 | Target spec revision. v1 supports 2026-07-28 only. |
--write | off | Apply the rewrites in place. Default is dry-run (print the per-file action plan). |
Exit code. 0 on a successful run (writes applied or dry-run completed). 2 if --to names an unsupported target version.
Example.
mcptest migrate examples/ # dry-run, prints what would change
mcptest migrate --write suites/ # apply the rewrites in place
Status. Working: YAML annotation plus the legacy error-code rewrite. Cassette rewrites land in a later release alongside the streamable-HTTP transport rerouting.
discover
mcptest [GLOBAL_OPTIONS] discover <SERVER> [--output PATH] [--bearer-token-env NAME]
Description. Connect to an MCP server, run the initialize handshake, and call tools/list, resources/list, and prompts/list. Pretty-prints the discovered capabilities to stderr and writes a starter tests.yaml with one smoke test per tool.
Arguments.
| Argument | Type | Required | Description |
|---|---|---|---|
<SERVER> | URL or name=url pair | yes | MCP server endpoint. Bare URLs are labelled discovered; name=https://... lets you pick a friendlier server name in the scaffolded YAML. |
--output <PATH> | filesystem path | optional, default tests.yaml | Path for the scaffolded suite. |
--bearer-token-env <NAME> | string | optional | Read a bearer token from this env var and send it as Authorization: Bearer <value> during the probe. |
Examples.
# Scaffold against a local server.
mcptest discover http://localhost:8080/mcp
# Scaffold against an authenticated server, into a custom path.
MCP_TOKEN=abc mcptest discover https://api.example.com/mcp \
--bearer-token-env MCP_TOKEN \
--output suites/example.yaml
Exit codes. 0 on success, 1 when the handshake fails or the server is unreachable.
Status. Working.
completions
mcptest [GLOBAL_OPTIONS] completions <SHELL>
Description. Emit a shell completion script for the chosen shell. Lives in this subcommand rather than as a flag so users can pipe straight into a shell init file.
Arguments.
| Argument | Type | Required | Description |
|---|---|---|---|
<SHELL> | enum {bash, zsh, fish, powershell, elvish} | yes | Shell to emit completions for. Backed by clap_complete::Shell. |
Examples.
# Bash, system-wide.
mcptest completions bash | sudo tee /etc/bash_completion.d/mcptest > /dev/null
# Zsh, per-user.
mcptest completions zsh > ~/.zsh/completions/_mcptest
# Fish, per-user.
mcptest completions fish > ~/.config/fish/completions/mcptest.fish
# PowerShell.
mcptest completions powershell | Out-String | Invoke-Expression
Exit codes. 0 on success. clap rejects unknown shells with its standard error path (exit code 2).
Status. Working. The handler is a single call into clap_complete.
model-compat
mcptest [GLOBAL_OPTIONS] model-compat <SUBCOMMAND>
Description. Capture, diff, and replay model-compatibility baselines. The v1.0 headline workflow: snapshot the suite against a model, then compare a later run against that snapshot to classify every assertion as PASS, DRIFT, or FAIL.
Subcommands.
| Subcommand | Purpose |
|---|---|
capture | Write a baseline JSON file for the given model. |
diff | Compare two saved baselines and render the result. |
run | Re-run the suite against a candidate and diff against the baseline. |
mcptest model-compat capture.
mcptest model-compat capture --model ID --output PATH --input PATH [--filter GLOB]
| Argument | Type | Required | Description |
|---|---|---|---|
--model <ID> | string | yes | Provider-qualified model identifier (for example anthropic:claude-sonnet-4.5). |
--output <PATH> | filesystem path | yes | Destination for the baseline JSON. |
--input <PATH> | filesystem path | yes | Source baseline file from the runner. The live runner integration is a follow-up ticket; for v1.0 this flag lets capture ride on a pre-built baseline so the CLI surface is exercised end to end. |
--filter <GLOB> | string | optional | Narrow the captured assertion list with a *-style glob. |
mcptest model-compat diff.
mcptest model-compat diff <BASELINE> <CANDIDATE> [--format FORMAT] [--filter GLOB]
| Argument | Type | Required | Description |
|---|---|---|---|
<BASELINE> | filesystem path | yes | The baseline (left-hand side). |
<CANDIDATE> | filesystem path | yes | The candidate (right-hand side). |
--format <FORMAT> | enum {pretty, json} | optional, default pretty | Output format. JSON renders the full BaselineDiff. |
--filter <GLOB> | string | optional | Narrow which assertion ids appear in the diff. |
mcptest model-compat run.
mcptest model-compat run --baseline PATH --model ID --candidate PATH
| Argument | Type | Required | Description |
|---|---|---|---|
--baseline <PATH> | filesystem path | yes | Saved baseline to diff against. |
--model <ID> | string | yes | Candidate model identifier; printed in the report header. |
--candidate <PATH> | filesystem path | yes | Pre-captured candidate baseline. The live runner lands in a follow-up ticket; for v1.0 this flag lets run exercise the PASS/DRIFT/FAIL exit handling. |
Examples.
# Capture a baseline against the current production model.
mcptest model-compat capture \
--model anthropic:claude-sonnet-4.5 \
--output baselines/sonnet-4.5.json \
--input artifacts/last-run.json
# Compare two saved baselines as JSON for a CI gate.
mcptest model-compat diff baselines/sonnet-4.5.json baselines/sonnet-5.0.json --format json
# Run the suite against a candidate and exit per the rubric.
mcptest model-compat run \
--baseline baselines/sonnet-4.5.json \
--model anthropic:claude-sonnet-5.0 \
--candidate artifacts/sonnet-5.0.json
Exit codes. 0 PASS (every assertion classified PASS). 6 DRIFT (at least one DRIFT, no FAIL). 1 FAIL (any invariant violated or assertion missing).
Status. Working. Library entry points live in mcptest_core::model_compat; the CLI dispatches to commands::model_compat.
compliance
mcptest [GLOBAL_OPTIONS] compliance <SUBCOMMAND>
Description. Score an MCP server against the compliance rubric and render the result in one of four formats. The score is the same ComplianceScore from mcptest_core::compliance::scoring; the four renderers in mcptest_core::compliance::renderers (pretty, JSON, Markdown, HTML) own the presentation. When a baseline is supplied the four BaselineDecision outcomes drive the exit code so CI can stay green while a known set of MUSTs is still pending.
Subcommands.
| Subcommand | Purpose |
|---|---|
run | Run the compliance corpus and render the score. |
invariants | Evaluate spec-derived conformance invariants over a captured session, plus multi-server composition-safety checks. |
mcptest compliance run.
mcptest compliance run \
--results-from PATH \
[--format FORMAT] \
[--server-label LABEL] \
[--registry PATH] \
[--capabilities LIST] \
[--baseline PATH | --expected-failures PATH] \
[--update-baseline] [--yes]
| Argument | Type | Required | Description |
|---|---|---|---|
--results-from <PATH> | filesystem path | yes | JSON list of CheckResult records produced by the runner. The live runner integration lands in a follow-up ticket; for v1.0 this flag lets the score and reporter surface ride on a pre-built artifact. |
--format <FORMAT> | enum {pretty, json, markdown, html} | optional, default pretty | Output format. |
--server-label <LABEL> | string | optional | Label printed in the report header. Falls back to the global --server-url when omitted. |
--registry <PATH> | filesystem path | optional, default compliance/registry.yml | Path to the rule registry YAML. |
--capabilities <LIST> | comma list | optional | Capabilities the server declared during initialize (for example tools,resources). Drives section applicability. |
--baseline <PATH> | filesystem path | optional | Baseline file listing expected failures. Layers the four BaselineDecision outcomes on top of the run. |
--expected-failures <PATH> | filesystem path | optional | Alias for --baseline. |
--update-baseline | flag | optional | Rewrite the baseline file from the current run after a confirmation prompt. Requires --baseline (or --expected-failures). |
--yes | flag | optional | Skip the confirmation prompt for --update-baseline. |
Exit codes. Without a baseline: 0 when no MUSTs failed, 1 when at least one did. With a baseline:
| Decision | Meaning | Exit |
|---|---|---|
NormalPass | Check passed and is not on the baseline. | 0 |
ExpectedFailure | Check failed but is on the baseline. | 0 |
NewRegression | Check failed and is NOT on the baseline. | 1 |
StaleBaseline | Check passed but is still on the baseline. | 1 |
Examples.
# Render the score as Markdown for a PR comment.
mcptest compliance run \
--results-from artifacts/compliance.json \
--format markdown \
--capabilities tools,resources
# Gate CI on a baseline so known failures stay green.
mcptest compliance run \
--results-from artifacts/compliance.json \
--baseline compliance-baseline.yml
# Regenerate the baseline after a deliberate cleanup.
mcptest compliance run \
--results-from artifacts/compliance.json \
--baseline compliance-baseline.yml \
--update-baseline --yes
mcptest compliance invariants.
mcptest compliance invariants --capture PATH [--format FORMAT]
| Argument | Type | Required | Description |
|---|---|---|---|
--capture <PATH> | filesystem path | yes | JSON capture file. A single session object runs the per-server invariants; an array of sessions also runs the multi-server composition-safety checks. |
--format <FORMAT> | enum {pretty, json} | optional, default pretty | Output format. |
The invariants are the INV-NNN family (handshake ordering, capability attestation, tool result-shape, JSON-RPC error envelopes). With two or more sessions the command also checks tool-namespace overlap and shared-transport id collisions. The capture is read from disk so the run is deterministic and contacts no server. Exit 0 when every invariant passes with no composition hazard, 1 otherwise. See docs/conformance-invariants.md.
Status. Working. Library entry points live in mcptest_core::compliance; the CLI dispatches to commands::compliance.
pipe
mcptest pipe <PIPELINE> [--url URL] [--bearer-token-env NAME]
[--var KEY=VALUE ...] [--format pretty|json]
[--dry-run [--estimate-cost]]
[--max-cost USD] [--max-tokens N]
[--max-cost-per-call USD] [--max-duration DURATION]
[--on-budget-exceeded stop|continue|warn]
[--pricing-table PATH]
Description. Run a declarative multi-step tool-call pipeline. Each step calls a tool, extracts values from earlier steps by reference, and binds them into later steps. The full pipeline YAML grammar, the reference expression language (${step.field}, ${var.X}, ${env.X}), the when: guard, on_error, and the cumulative budget controls are documented in reference/pipe.md.
Arguments.
| Argument | Type | Required | Description |
|---|---|---|---|
<PIPELINE> | path | yes | The pipeline YAML file. |
--url <URL> | string | optional | MCP server the pipeline's tool calls target. |
--bearer-token-env <NAME> | string | optional | Env var holding a bearer token for every request. |
--var <KEY=VALUE> | repeatable | optional | Inject a variable referenced as ${var.KEY}. |
--format <pretty|json> | enum | optional | pretty prints the last step's result; json prints the full execution trace. |
--dry-run | flag | optional | Print the planned execution without making any tool calls. |
--estimate-cost | flag | optional | With --dry-run, also print a projected cost estimate. |
--max-cost <USD> | float | optional | Aggregate USD ceiling across all steps. |
--max-tokens <N> | integer | optional | Aggregate token ceiling (input + output) across all steps. |
--max-cost-per-call <USD> | float | optional | Per-step USD ceiling, layered on the aggregate cap. |
--max-duration <DURATION> | duration | optional | Wall-clock ceiling (30s, 2m, 1h). |
--on-budget-exceeded <MODE> | enum | optional | stop (default) fails fast, continue runs to completion then exits non-zero, warn runs and exits 0. |
--pricing-table <PATH> | path | optional | Override the bundled pricing.yaml used for cost estimates. |
Example.
mcptest pipe examples/pipe-search-then-update.yml \
--url http://localhost:8000/mcp --var USER_QUERY=alice --max-cost 0.50
Status. Working (pipelines and budgets).
tools, resources, prompts, capabilities
mcptest tools --url URL [--bearer-token-env NAME] [--format pretty|json]
mcptest tools call <TOOL> --url URL [--args JSON | --args-from-stdin | --arg NAME=$.path ...]
[--bind NAME=$.path ...] [--then TOOL [--then-arg NAME=VALUE|NAME=:bound]...]
[--select $.path] [--json] [--max-cost USD]
mcptest resources --url URL [--format pretty|json]
mcptest prompts --url URL [--format pretty|json]
mcptest capabilities --url URL [--format pretty|json]
Description. Introspect a live server's catalog. The bare tools / resources / prompts forms list the catalog; capabilities prints the initialize capability block. tools call <TOOL> runs one tool imperatively with the chaining primitives so a shell pipeline can extract and forward values without reaching for jq.
tools call arguments.
| Argument | Type | Description |
|---|---|---|
<TOOL> | string | Tool to call. |
--url <URL> | string | MCP server endpoint. |
--args <JSON> | string | Literal args object as a JSON string. |
--args-from-stdin | flag | Read the entire args object from stdin JSON. Conflicts with --args. |
--arg <NAME=$.path> | repeatable | Extract a value from stdin JSON by JSONPath and use it as the named arg. |
--bind <NAME=$.path> | repeatable | Capture a value from this call's output for a chained --then step. |
--then <TOOL> | string | Chain a second tool call within the same invocation. |
--then-arg <NAME=VALUE> | repeatable | Argument for the --then step. name=:bound references a --bind capture. |
--select <$.path> | string | Project the output by JSONPath before printing. |
--json | flag | Emit JSON instead of pretty output. |
--max-cost <USD> | float | Aggregate USD ceiling across both calls in this invocation. |
Example.
mcptest tools call search --url "$URL" --args '{"query":"alice"}' \
--bind user_id=$.results[0].id \
--then fetch_user --then-arg user_id=:user_id --select $.email
Status. Working (introspection and chaining).
inspect
mcptest inspect --url URL [--bearer-token-env NAME]
mcptest inspect -- <command> [args...]
Connect to one MCP server and explore it in an interactive REPL: the terminal sibling of the one-shot tools / resources / prompts / capabilities / discover commands. Target the server either over streamable HTTP (--url) or over stdio (everything after -- is the server command). The REPL reads commands from stdin, so a piped script drives it the same way an interactive session does.
REPL commands (type help to list them in-session):
| Command | Action |
|---|---|
tools / ls | List tools. |
call <tool> [json] | Call a tool with a JSON-object argument (default {}). |
resources | List resources. |
read <uri> | Read a resource. |
prompts | List prompts. |
prompt <name> [json] | Get a prompt. |
capabilities / caps | Show the server capability block. |
notifications / notif | Show notifications received this session. |
discover [path] | Scaffold a tests.yaml from the live catalog (default tests.yaml). |
quit / exit / q | Leave the session. |
Live server activity is surfaced as it arrives. Notifications (logging, progress, list_changed) print with a <- notification prefix. Server-initiated requests are fulfilled automatically so a server that drives the client can be exercised end to end: sampling/createMessage returns a stub assistant message (no real model call), elicitation/* is declined, and roots/list returns an empty list. inspect advertises the matching client capabilities during the handshake so the server knows it may use them.
Example.
# Stdio server
mcptest inspect -- npx -y @modelcontextprotocol/server-everything
# HTTP server, scripted (non-interactive)
printf 'tools\ncall search {"query":"alice"}\nquit\n' \
| mcptest inspect --url "$URL" --bearer-token-env MCP_TOKEN
Status. Working (terminal-only; web viewers are out of scope).
mcp-server
mcptest mcp-server [--workspace PATH] [--enable-writes] [--mcptest-bin PATH]
Description. Run mcptest itself as a stdio MCP server so an MCP-aware agent (Claude Code, Cursor, mcp-inspector) can query runs, cassettes, coverage, and diagnostics from inside the editor. Read tools are always available; --enable-writes adds the run-triggering and cassette-recording tools. Full tool and resource catalog in mcp-server.md.
Arguments.
| Argument | Type | Description |
|---|---|---|
--workspace <PATH> | path | Workspace root. Defaults to the current directory. |
--enable-writes | flag | Enable the write tools (trigger_run, record_cassette). Off by default. |
--mcptest-bin <PATH> | path | Override the mcptest binary the write tools spawn. |
Status. Working. Registered in an agent config as command: "mcptest", args: ["mcp-server", "--workspace", "."].
generate
mcptest generate stubs --url URL [--bearer-token-env NAME]
[--output DIR] [--overwrite] [--stdout] [--check]
mcptest generate suite --from-config FILE [--server-name NAME]
[--models ID,ID,...] [--no-edge] [--no-violation]
[--output PATH | --update PATH]
Description. Scaffold runnable YAML tests from a server's tool catalog. Wrapped under a subcommand so generators land as siblings.
generate stubs introspects a live server and emits one test stub file per advertised tool.
generate suite synthesizes one self-contained suite document from a server's declared tools: a servers: block, a multi-model matrix placeholder, and three cases per tool (valid arguments, a boundary edge case, and a schema-violation case that expects an error). The emitted YAML validates against schemas/v1.json, so it runs as written.
generate stubs arguments.
| Argument | Type | Description |
|---|---|---|
--url <URL> | string | MCP server endpoint to introspect. |
--bearer-token-env <NAME> | string | Env var holding a bearer token forwarded to every request. |
--output <DIR> | path | Directory the generated YAML is written under. Default tests/tools. |
--overwrite | flag | Replace existing stub files instead of skipping them. |
--stdout | flag | Print every stub concatenated to stdout instead of writing to disk. |
--check | flag | Exit 6 if any generated stub differs from the checked-in file. CI drift detection. |
generate suite arguments.
| Argument | Type | Description |
|---|---|---|
--from-config <FILE> | path | Read the declared tool list from a file: a tools/list JSON snapshot ({"tools": [...]}), a bare JSON tools array, or a YAML mock manifest (mock_server.tools[]). |
--server-name <NAME> | string | Server key the generated tests reference, also the servers: entry. Default server. |
--models <ID,ID,...> | list | Model identifiers for the model_compatibility: matrix. Omit to use the built-in default lineup. |
--no-edge | flag | Skip the boundary edge case per tool. |
--no-violation | flag | Skip the schema-violation case per tool. |
--output <PATH> | path | Write the suite to a file instead of stdout. |
--update <PATH> | path | Merge into an existing suite, keeping every hand-authored test and appending only tests whose name is new. Mutually exclusive with --output. |
Why a file, not a live connection. Reading tools from a file keeps generation deterministic and CI-reproducible. Live tools/list introspection reuses the same connector as generate stubs --url and is the planned follow-up.
Status. Working (stubs and suite).
mock
mcptest mock --tools-from PATH
Description. Spawn a YAML-driven stdio mock MCP server. The mock loads its tool catalog from --tools-from and serves it over stdio, so a client integration can be exercised without the real backend. v1.0 ships stdio only.
Arguments.
| Argument | Type | Description |
|---|---|---|
--tools-from <PATH> | path | A YAML manifest (mock_server.tools[]) or a tools/list baseline JSON snapshot. |
Status. Working. The cassette-driven mock (--cassettes) is a separate draft, see mcptest-mock.md.
exec
mcptest exec --connection-server [--config PATH] [--ipc-version VERSION]
[--no-cache] [--no-cassette] [--record-cassettes]
[--debug-output PATH] [--verbose]
Description. Run mcptest as an IPC co-process for the native SDKs. The SDK (pytest, Vitest, Go, etc.) spawns this command, pipes newline-delimited JSON-RPC over stdin/stdout, and reads canonical responses back. You do not run this by hand; the language SDKs invoke it.
Arguments.
| Argument | Type | Description |
|---|---|---|
--connection-server | flag | Required mode toggle (the only mode in v1). |
--config <PATH> | path | mcptest config. Defaults to mcptest.yaml. |
--ipc-version <VERSION> | string | Pin the IPC envelope version. The dispatcher rejects a version newer than it supports. |
--no-cache | flag | Disable the cache for this session. |
--no-cassette | flag | Disable the cassette layer (no replay, no record). |
--record-cassettes | flag | Record new cassettes for any unmocked call seen this session. |
--debug-output <PATH> | path | Write a verbatim transcript for SDK debugging. |
--verbose | flag | Emit envelope counts and lifecycle events on stderr. |
Status. Working.
login
mcptest login [SERVER] [--url URL] [--client-id ID] [--all] [--force] [--no-browser]
Description. Interactive OAuth 2.1 + PKCE login that caches a token for later runs. Discovers the IdP from the target's .well-known/oauth-authorization-server, runs the browser flow against a loopback listener, and caches the token (and any Dynamic Client Registration metadata) for subsequent mcptest run invocations.
Arguments.
| Argument | Type | Description |
|---|---|---|
[SERVER] | string | Named server from mcptest.yml to log in to. Mutually exclusive with --url and --all. Omit for the single configured URL server or an interactive picker. |
--url <URL> | string | Authenticate against a URL not recorded in mcptest.yml. Conflicts with --all. |
--client-id <ID> | string | Fallback OAuth client_id for IdPs without a registration_endpoint. Ignored when DCR is available. |
--all | flag | Log in to every configured URL server in declaration order. Stdio servers are skipped with a warning. |
--force | flag | Clear the cached token and DCR metadata before running the flow. |
--no-browser | flag | Print the authorization URL on stdout instead of opening a browser. The loopback listener still accepts the callback. CI and headless escape hatch. |
Status. Working.
prompt
mcptest prompt [--output PATH]
Description. Print a copy-paste-ready grounding prompt for an LLM assistant writing mcptest YAML. Prints to stdout by default.
Arguments.
| Argument | Type | Description |
|---|---|---|
--output <PATH> | path | Write the prompt to a file instead of stdout. |
Status. Working.
cache
mcptest cache [--cache-dir PATH] <list|stats|clear|prune>
mcptest cache clear [--older-than DURATION]
Description. Inspect or evict the local cache store. The --cache-dir override points the store at a non-default root and is accepted on every cache subcommand.
Subcommands.
| Subcommand | Description |
|---|---|
list | List every cached entry with size and age. |
stats | Print totals plus the hit-rate row. |
clear [--older-than DURATION] | Remove entries. Without --older-than, removes everything. Duration like 30m, 2h, 7d. |
prune | Remove entries older than 30 days. CI-friendly alias so scripts do not pick a number. |
Status. Working.
security
mcptest [GLOBAL_OPTIONS] security <SNAPSHOT> [--format FORMAT] [--fail-on SEVERITY]
Description. Scan a tools/list-style JSON snapshot with the bundled deterministic security checks and report the findings. No model decides a verdict: every check is a regex or structural predicate over the tool, prompt, and resource definitions, so a finding is reproducible. The first bundled lane is the tool-surface family (SEC-001 through SEC-009); see the security test catalog.
Arguments.
| Argument | Type | Required | Description |
|---|---|---|---|
<SNAPSHOT> | filesystem path | yes | A JSON snapshot that may carry tools, prompts, and resources arrays. |
--format <FORMAT> | enum {pretty, json, sarif, html, md} | optional, default pretty | Output format. SARIF 2.1.0 drops into code scanning; html/md emit a reviewer-grade vulnerability report with an OWASP LLM Top 10 coverage table (see security-report.md). |
--fail-on <SEVERITY> | enum {info, low, medium, high, critical} | optional, default high | Exit code 1 when any finding is at or above this severity. |
Examples.
# Human-readable summary.
mcptest security tools-list.json
# Hard fail in CI on any high or critical finding.
mcptest security tools-list.json --fail-on high
# SARIF for code scanning.
mcptest security tools-list.json --format sarif > security.sarif
Exit codes. 0 when nothing fires at or above --fail-on, 1 when something does, 2 when the snapshot cannot be read or parsed.
Subcommands.
security redteamdrives the live red-team corpus against a running server (advisory only, never the verdict).security importfolds an external scanner's output into a unified report (see below).
Status. Working. The engine lives in mcptest_core::security; the active probes and the integrity, namespace, and advisory lanes are tracked under the security-framework epic.
security import
mcptest security import [--sarif FILE]... [--snyk FILE]... [--supplement FILE]... \
[--snapshot FILE] [--advisory] [--format FORMAT] [--fail-on SEVERITY]
Description. mcptest owns the ingest, not the scan. import normalizes a scanner you already run into the finding vocabulary, dedups it against the bundled catalog (an overlapping SEC rule is counted once), and prints one unified report. SARIF 2.1.0 is read with --sarif, Snyk agent-scan JSON with --snyk, and any other JSON shape (a top-level array or a findings array) with --supplement. Each flag is repeatable. With --snapshot, the bundled deterministic lanes also run and the imports fold in beside them. See the external-scanner supplement.
Arguments.
| Argument | Type | Required | Description |
|---|---|---|---|
--sarif <FILE> | filesystem path, repeatable | one of the three | A SARIF 2.1.0 log. The scanner name is read from its tool driver. |
--snyk <FILE> | filesystem path, repeatable | one of the three | A Snyk agent-scan ScanPathResult JSON file. |
--supplement <FILE> | filesystem path, repeatable | one of the three | A generic scanner JSON file. |
--snapshot <FILE> | filesystem path | optional | A tools/list snapshot to also scan with the bundled lanes. |
--advisory | flag | optional | Mark every import advisory, so none of it gates. |
--format <FORMAT> | enum {pretty, json, sarif, html, md} | optional, default pretty | Output format. html/md emit a vulnerability report with OWASP coverage (see security-report.md). |
--fail-on <SEVERITY> | enum {info, low, medium, high, critical} | optional, default high | Exit 1 when any counted finding is at or above this floor. |
Examples.
# Fold an AgentSeal SARIF file and a Snyk agent-scan JSON file into one report.
mcptest security import \
--sarif examples/security/agentseal.sarif.json \
--snyk examples/security/snyk-agent-scan.json
# Combine an import with a bundled snapshot scan, emitting SARIF.
mcptest security import --sarif scan.sarif \
--snapshot tools-list.json --format sarif > security.sarif
Exit codes. 0 when nothing counted fires at or above --fail-on, 1 when something does, 2 when a file cannot be read or no scanner file is given.
sbom
mcptest [GLOBAL_OPTIONS] sbom [--format FORMAT] [--out PATH] [--verify]
Description. Print the CycloneDX 1.5 Software Bill of Materials that the build script baked into the binary at compile time, list licenses, or verify the embedded blob has not been swapped at runtime. The full guide lives at Software Bill of Materials.
Arguments.
| Argument | Type | Required | Description |
|---|---|---|---|
--format <FORMAT> | enum {cyclonedx, licenses, names} | optional, default cyclonedx | cyclonedx is the raw embedded JSON; licenses is one line per dep with its SPDX expression; names is one line per dep with just name and version. |
--out <PATH> | filesystem path | optional | Write the output here instead of stdout. |
--verify | flag | optional | Re-hash the embedded BOM at runtime, compare to the build-time SHA, exit 0 on match and 2 on mismatch. |
Examples.
# Pipe the BOM into a scanner.
mcptest sbom > mcptest.cdx.json
# Quick license inventory.
mcptest sbom --format licenses
# Confirm the embedded blob has not been tampered with at runtime.
mcptest sbom --verify
Exit codes. 0 on success or successful verification, 2 when --verify detects a hash mismatch.
Status. Working.
evidence
mcptest [GLOBAL_OPTIONS] evidence <REPORT> [--security FILE] [--reproducible] [--out PATH] [--sign]
mcptest [GLOBAL_OPTIONS] evidence verify <EVIDENCE> [--max-age DURATION] [--signature FILE] [--require-signed]
Description. Aggregate a mcptest run --format json report into a portable evidence artifact (server identity, spec version, corpus hash, source provenance, grades, reproducibility), or verify one. --sign reuses the release Sigstore cosign path to attach a detached signature. See portable run evidence.
Arguments (emit).
| Argument | Type | Required | Description |
|---|---|---|---|
<REPORT> | filesystem path | yes (unless a subcommand) | A serialized mcptest run --format json report. Must carry run metadata. |
--security <FILE> | filesystem path | optional | A mcptest security --format json report whose severity counts fold into the grades. |
--reproducible | flag | optional | Mark the run byte-reproducible (the sbom --verify / SOURCE_DATE_EPOCH parity signal). |
--out <PATH> | filesystem path | optional | Write the artifact here instead of stdout. Required with --sign. |
--sign | flag | optional | Sign the artifact with cosign sign-blob (keyless, GitHub OIDC), writing <out>.sig and <out>.cert. Requires cosign on PATH. |
Arguments (verify).
| Argument | Type | Required | Description |
|---|---|---|---|
<EVIDENCE> | filesystem path | yes | The evidence.json artifact to verify. |
--max-age <DURATION> | duration (720h, 30m) | optional | Reject evidence whose generated_at is older than this. |
--signature <FILE> | filesystem path | optional | Detached signature; defaults to <evidence>.sig when present. |
--require-signed | flag | optional | Reject the artifact when it is unsigned. |
Examples.
# Emit an artifact from a run, folding in a security scan.
mcptest evidence run.json --security security.json --reproducible --out evidence.json
# Sign it (needs cosign on PATH).
mcptest evidence run.json --out evidence.json --sign
# Verify: reject stale (>30d), forked, or unsigned evidence.
mcptest evidence verify evidence.json --max-age 720h --require-signed
Exit codes. Emit: 0 on success, 2 when the report cannot be read or carries no metadata (or --sign cannot run). Verify: 0 accepted, 1 rejected (reasons printed), 2 when the artifact cannot be read.
Status. Working. Cryptographic Sigstore verification (Rekor inclusion, certificate identity) is cosign verify-blob's job; evidence verify owns the freshness, commit-ancestry, and signature-presence policy.
ledger
mcptest [GLOBAL_OPTIONS] ledger emit <ENVELOPE> [--session-id ID] [--output PATH]
mcptest [GLOBAL_OPTIONS] ledger diff <BASELINE> <ACTUAL> [--max-diff N]
Description. Turn a saved agent run envelope into a session-ledger NDJSON file, or diff an actual ledger against a baseline trajectory. The ledger is the append-only, structured record of the tool calls a run made: one header record, then one tool_call record per call, in call order. See session ledger for the schema and field reference.
Arguments (emit).
| Argument | Type | Required | Description |
|---|---|---|---|
<ENVELOPE> | filesystem path | yes | A JSON file holding an agent run envelope with a tool_calls array (each entry has name, server, args). This is the shape a single agent test produces in mcptest run --reporter json. |
--session-id <ID> | string | optional | Session id stamped on every record. |
--output <PATH> | filesystem path | optional | Write the ledger here instead of stdout. |
Arguments (diff).
| Argument | Type | Required | Description |
|---|---|---|---|
<BASELINE> | filesystem path | yes | The recorded baseline ledger NDJSON. |
<ACTUAL> | filesystem path | yes | The fresh ledger to compare. |
--max-diff <N> | integer | optional (default 0) | Maximum tolerated divergent tool calls. The command exits non-zero once divergences exceed this; 0 requires an exact match. |
Examples.
# Record a baseline trajectory from a saved envelope.
mcptest ledger emit envelope.json --session-id run-42 --output baseline.ndjson
# Gate a fresh run against the baseline in CI (exact match).
mcptest ledger diff baseline.ndjson actual.ndjson --max-diff 0
The diff compares tool calls position by position per agent_id: a different tool at a hop is a remove plus an add, a matching tool with different params is a param change.
- removed hop 1: get_weather
+ added hop 1: delete
ledger diff: 2 divergence(s) exceed --max-diff 0
Exit codes. 0 clean (or divergences within --max-diff), 1 divergences exceed --max-diff, 2 when an input cannot be read.
Status. Working. The schema is owned here; see session ledger for the open-core boundary.
web-bot-auth
mcptest [GLOBAL_OPTIONS] web-bot-auth directory [--key PATH | --key-env VAR] [--algorithm ALG] [--agent URL]
Description. Emit the .well-known/http-message-signatures-directory JWK Set for a Web Bot Auth signing key. Only the public key is printed; the private key is never written to the output. See Web Bot Auth for the full signing and verification story.
Arguments (directory).
| Argument | Type | Required | Description |
|---|---|---|---|
--key <PATH> | filesystem path | one of --key/--key-env | PKCS#8 PEM file holding the private signing key. Only the derived public key is emitted. |
--key-env <VAR> | env var name | one of --key/--key-env | Env var holding the PKCS#8 PEM private key, so the key never appears on the command line. |
--algorithm <ALG> | ed25519 or rsa-pss | optional (default ed25519) | Signature algorithm. Must match the key type. |
--agent <URL> | URL | optional | Signature-Agent URL identifying the bot. Recorded in the validated config; it does not appear in the JWK Set itself. |
Examples.
# Emit the public JWK Set for an Ed25519 key.
mcptest web-bot-auth directory --key bot.ed25519
# Read the key from an env var instead of a file.
mcptest web-bot-auth directory --key-env BOT_SIGNING_KEY
Exit codes. 0 on success, non-zero when the key is missing or malformed.
Status. Working.
Exit codes
mcptest uses a small, stable set of exit codes so CI scripts can react without parsing stdout. Every subcommand documents which codes it can return; this table is the central reference.
| Code | Meaning | Source |
|---|---|---|
0 | Success. The command did what it was asked to do. | All subcommands. |
1 | Test failures or a malformed input artefact. | run, report (bad input), diff (breaking change with --fail-on-breaking true), eval (failing verdict), compliance (regression vs baseline), model-compat (FAIL). |
2 | Configuration error or invalid arguments. | validate, init (write conflict), report (collector rejection), run (config load failed). clap also returns 2 for unknown flags. |
3 | --wait-for-ready budget expired before a URL server accepted connections. | run, doctor. |
5 | Cost cap exceeded, or run --update-snapshots refused under CI=true. | eval, run. |
6 | Coverage below threshold, or a model-compat DRIFT. | coverage and run --coverage-threshold, model-compat (DRIFT). |
7 | No tests selected. The suite is empty, or --filter, --shard, or --last-failed matched nothing, and --pass-with-no-tests was not passed. | run. |
Codes outside this set are reserved. If you see one, it is almost certainly clap returning 2 for a parse error.
Note: a future doctor --lint-descriptions quality lint will land its own exit code when that feature ships; it is not wired in the v1.0 binary today.
Cross-references
- getting-started.md: five-minute walkthrough that installs the binary, scaffolds a project, and runs the first test.
- yaml-reference.md: every field in the YAML test format, the schema, and worked examples.
- troubleshooting.md: common failure modes, what each exit code means in practice, and how to diagnose a stuck run.
crates/mcptest/src/cli/: the source of truth for every flag on this page (cli/mod.rsfor the command tree,cli/args/for each subcommand's flags). If this doc disagrees with the source, the source wins; please file a ticket against the docs.schemas/v1.json: the JSON Schema emitted bymcptest schema.schemas/wire/v0.json: the upload envelope shape consumed bymcptest report --format upload.