CLI reference

Complete reference for every subcommand and global flag in the mcptest binary. Source of truth is crates/mcptest/src/cli/ (the Command enum in cli/mod.rs and the per-subcommand Args structs under cli/args/); this page mirrors that at the v1.0 cut. When a flag is wired to a stub handler, that is called out so a reader knows the implementation work is still pending.

For a friendlier introduction, start with getting-started.md. The YAML test format is documented in yaml-reference.md, and common failure modes are covered in troubleshooting.md.

Synopsis

mcptest [GLOBAL_OPTIONS] <SUBCOMMAND> [ARGS]

mcptest --help prints a clap-generated summary of every flag below. mcptest --version prints the build version. mcptest <SUBCOMMAND> --help prints the per-subcommand summary including any subcommand-specific flags.

Global flags are accepted before or after the subcommand name: both mcptest --debug run and mcptest run --debug parse identically.

Global options

These flags are declared on GlobalArgs in crates/mcptest/src/cli/global.rs. Every subcommand inherits them via #[command(flatten)], so they work uniformly regardless of which subcommand you call.

Output and logging

`--no-color`

Type: boolean flag
Default: off
Description: disable ANSI color output. Useful in CI logs and when piping to tools that mishandle escape sequences.
When to use it: any CI provider that captures raw stdout (GitHub Actions, GitLab CI, Buildkite) renders mcptest output more cleanly without color. Also useful when redirecting output to a file you plan to read later in a pager that does not handle ANSI.

`--debug`

Type: boolean flag
Default: off
Description: enable debug logging. Sets RUST_LOG=mcptest=debug if the variable is not already set, then initializes the tracing subscriber.
When to use it: you suspect mcptest is doing something different from what you asked. Debug logs include resolver decisions, HTTP request and response headers (redacted), and matcher dispatch.

`--verbose`

Type: boolean flag
Default: off
Description: print extra detail about variable resolution and discovered files. Lighter than --debug; aimed at humans rather than developers.
When to use it: you want to know which .env file shaped the run, or which --var overrides ended up winning. The output is sorted and stable so it diffs cleanly across runs.

Logging

mcptest emits structured log events through tracing once the binary starts. The subscriber writes to stderr (stdout is reserved for reporter output) and is filtered by an EnvFilter resolved from the four sources below.

`--log-level <LEVEL>`

Type: string. Either a single level (off, error, warn, info, debug, trace) or a RUST_LOG-style directive like mcptest_core::cache=debug,mcptest_core::runner=trace.
Default: warn (so a passing run emits no stderr).
When to use it: you want to dial verbosity for one run without exporting an env var. Invalid directives are rejected before the run starts so a typo never silently falls back to the default.

# Trace cache decisions for one debugging session.
mcptest --log-level "mcptest_core::cache=debug" run

# Quietest possible run; exit code is the only signal.
mcptest --log-level off run

Filter resolution precedence

Highest first:

--log-level <VAL>
RUST_LOG env var (standard convention).
MCPTEST_LOG env var. Use this when a parent process sets RUST_LOG=trace for its own purposes and you do not want that flood to leak into mcptest output.
--debug (back-compat: maps to mcptest=debug,mcptest_core=debug,mcptest_config=debug).
--verbose (back-compat: maps to mcptest=info,mcptest_core=info,mcptest_config=info).
Built-in default: warn.

Other logging knobs

NO_COLOR=1 or --no-color disables ANSI escapes in log output.
MCPTEST_LOG_NO_TIME=1 suppresses the RFC 3339 timestamp at the start of every log line. Useful when snapshotting stderr in tests.

What gets logged

mcptest_core::runner: run start/end, parallelism, per-test status and duration, fail-fast trips.
mcptest_core::executor: action dispatch, deferred matcher hits.
mcptest_core::connector: connect start/success per server, negotiated protocol version, shutdown errors.
mcptest_core::protocol: request method and id; timeouts.
mcptest_core::transport::stdio: child argv, pid, malformed JSON lines, child stderr.
mcptest_core::transport::streamable_http: connect URL, non-2xx responses.
mcptest_core::cache::store::fs: put/get/delete and eviction counts.
mcptest_core::cache::eligibility: per-test cache decisions.
mcptest_config::loader: YAML load + validate, schema-skip warnings.

Credentials are never logged. The connector and transport instrumentation redacts Authorization and any header name matching (?i)(token|secret|password|cookie|auth) to ***.

`--quiet`

Type: boolean flag
Default: off
Description: suppress progress output (banners, per-second elapsed lines). Exit codes still communicate success and failure.
When to use it: CI logs where the only signal that matters is the final status line. Combine with --reporter junit so machine-readable output still lands in the JUnit file.

`--reporter <FORMAT>`

Type: enum {pretty, minimal, json, junit, md, html, sarif, gitlab, ndjson, tap, matrix, matrix-md, quiet}
Default: pretty
Description: output format for mcptest run. Pairs with --output to pick the sink. The same formats (except minimal) are available on mcptest report --format.
When to use it:
- pretty: interactive shells and local development. Prints one line per test plus a summary, with failure detail inline.
- minimal: a compact one-line summary (ran N tests: ...) on stdout, with a FAIL line per failure on stderr. The legacy default; handy for terse CI logs.
- json: tooling that consumes the full run record. A json file is the canonical record mcptest report re-renders.
- junit: a CI test reporter (GitHub Actions, GitLab, CircleCI Insights).
- md: a Markdown summary for a PR comment or job summary.
- html: a self-contained HTML build artifact.
- sarif: GitHub Code Scanning (see sarif-reporter.md).
- gitlab: GitLab Code Quality (see gitlab-code-quality.md).
- ndjson: one JSON record per line, for log pipelines and jq -c.
- tap: Test Anything Protocol v14, for prove/tappy-style consumers.
- matrix: a self-contained HTML test-by-model comparison grid (see matrix-reporter.md). The default output of a --models sweep.
- matrix-md: the same comparison grid as GitHub-flavored Markdown.
- quiet: only the exit code matters. Equivalent to --quiet --reporter pretty but more explicit.

`--output <PATH>`

Type: filesystem path. Use - for stdout.
Default: none (the chosen format renders to stdout)
Description: sink for the --reporter format. A path writes to that file; a file write also echoes a one-line summary to stderr so the result stays visible. - is an explicit stdout.
When to use it: you want the JUnit XML in a known location, or you want to capture the JSON run record so mcptest report can re-render it later without a second run (--reporter json --output run.json).

`--annotations <WHEN>`

Type: enum {auto, always, never}
Default: auto
Description: emit GitHub Actions inline annotations (::error/::warning) to stderr alongside the normal output, one per failure. auto emits only inside Actions (GITHUB_ACTIONS=true); always forces them; never disables them. Composes on top of any --reporter format.

`--color <WHEN>`

Type: enum {auto, always, never}
Default: auto
Description: colorize human-readable output (the pretty run/report summary and the mcptest security findings). auto colors only when the output is a terminal and the NO_COLOR environment variable (no-color.org) is unset; always forces color even when piped; never disables it. A file sink (--output) and machine formats (JSON, SARIF, JUnit, NDJSON, ...) stay plain regardless.

Configuration sources

`--config <PATH>`

Type: filesystem path
Default: ./mcptest.yaml if present
Description: path to the mcptest YAML config. The file is parsed and validated against the JSON Schema before any test runs.
When to use it: your test suite lives outside the repo root (a subdirectory, a separate config repo, a generated file in /tmp). The path may be absolute or relative to the current working directory.

`--env-file <PATH>`

Type: filesystem path
Default: none
Description: environment file to load before running. Lines look like KEY=VALUE; quoted values, comments, and blank lines are tolerated. Loaded in addition to any .env, .env.local, or .env.test files discovered in the working directory.
When to use it: secrets live in a file named something other than .env (a .env.staging, a secrets.env from a vault dump). Repeatable behavior is not supported in v1.0; pass one explicit env file at a time.

`--no-env-file`

Type: boolean flag
Default: off
Description: skip auto-discovery of .env, .env.local, and .env.test.
When to use it: CI where the runner injects every variable through real environment variables and a stray .env checked into the repo should not change behavior. Also useful when reproducing a failing CI run locally and you want to make sure your local .env is not silently winning.

`--var KEY=VALUE`

Type: repeatable key-value pair
Default: empty
Description: override a variable. The key must be non-empty; the value may be empty (so --var FOO= clears an inherited value).
When to use it: a one-off run with a different model name, base URL, or feature flag. --var has the highest precedence: it beats --env-file, auto-discovered dotenvs, and the process environment.

`--show-secrets`

Type: boolean flag
Default: off (values render as ***)
Description: print resolved variable values verbatim instead of redacting them.
When to use it: only when you are debugging on your own machine, not in a shared terminal, and not in CI. Anything the resolver sees becomes visible: bearer tokens, API keys, internal URLs. Treat this like set -x for secrets.

Test selection and execution

`--filter <EXPR>`

Type: string expression
Default: none (run every test)
Description: filter expression. Only tests whose name or tag matches the expression run. Substring match in v1.0; richer query syntax is planned.
When to use it: you are iterating on a single test or a tag like @smoke and do not want to wait for the full suite.

`--parallel <N>`

Type: integer (0 means auto)
Default: auto (typically the CPU core count)
Description: maximum number of tests to run in parallel. 0 defers to the runner; any positive integer pins the cap.
When to use it: your MCP server has rate limits and you want to keep concurrency low, or you are debugging a flaky test and want --parallel 1 to remove interleaving.

`--timeout <SECONDS>`

Type: integer seconds
Default: value from the YAML config, or the runner default
Description: per-test timeout, in seconds. Overrides any value in the config. Accepts whole numbers only.
When to use it: a particular run has a slow target (a cold-started Lambda, a remote dev machine), or you want a strict timeout in CI to avoid runaway jobs.

`--retry <N>`

Type: integer
Default: 0
Description: retry each failing test up to N times before counting it as failed. Retries do not paper over real bugs; they smooth out network blips against flaky third-party services.
When to use it: flake-prone integrations that you cannot improve directly. Always file a follow-up to fix the underlying flake; --retry is a CI patch, not a fix.

`--watch`

Type: boolean flag
Default: off
Description: re-run on file changes (watch mode). Stub in v1.0; live wiring is still pending.
When to use it: interactive development. Until the wiring lands the flag parses but does not yet block on file events.

`--wait-for-ready[=DURATION]`

Type: optional duration with units. Accepts a bare integer (seconds), or Ns, Nm, Nh. Examples: 30, 30s, 2m, 1h.
Default: not set. When passed without =..., defaults to 60s.
Description: before the run connects, poll every URL server until it accepts a TCP connection, or fail after the configured budget. The driver uses exponential backoff (250ms, 500ms, 1s, 2s, then 5s steady). On mcptest run the budget is shared across all URL servers; on mcptest doctor it polls the --url target. A server that never comes up exits 3; stdio servers are spawned by mcptest, so they are skipped (nothing to wait for). Once the listener is up the normal connect proceeds, so a 401 or protocol error surfaces through the run's own fast-fail rather than the readiness loop.
When to use it: preview-deploy CI where the target spins up alongside the test job and is not immediately reachable. --wait-for-ready=60s is a sane default; bump it for cold-started containers.

Server target overrides

These flags let you change the server: block in the YAML suite at run time. They are useful for preview deploys and CI matrices where the YAML is authored without knowing the target URL.

`--server-url <URL>`

Type: URL string
Default: none (use the in-suite server: block)
Description: override server.url for every server in the YAML suite. Mutually exclusive with --server-command. Wins over the same field in --server-config.
When to use it: preview deploy on a PR-specific URL, or running the same YAML against staging and production in two consecutive CI steps.

`--server-command <CMD>`

Type: shell-quoted command string
Default: none
Description: override server.command for every server in the YAML suite. The argument is split using POSIX shell rules (the shell-words crate), so --server-command "./dev-server --debug" parses to ["./dev-server", "--debug"]. Mutually exclusive with --server-url. Wins over the same field in --server-config.
When to use it: you want to point a YAML written for a remote URL at a locally-built binary instead, without editing the YAML.

`--server-auth-bearer-env <NAME>`

Type: env variable name
Default: none
Description: set server.auth.bearer_token_env for URL targets. The runner reads NAME from the environment at connect time and sends the value as Authorization: Bearer <value>. Wins over the same field in --server-config.
When to use it: the YAML hard-codes the bearer env name for production (PROD_BEARER_TOKEN) and a preview environment uses a different secret name (PREVIEW_BEARER_TOKEN).

`--server-config <PATH>`

Type: filesystem path
Default: none
Description: load a YAML file containing a full server: block and use it in place of the in-suite server block. Lower precedence than the single-field flags above: when a field appears in both --server-config and a flag like --server-url, the explicit flag wins.
When to use it: you maintain per-environment server snippets (server.prod.yaml, server.staging.yaml) and want to swap them in without editing the main suite.

HTTP transport

`--header NAME=VALUE`

Type: repeatable key-value pair
Default: empty
Description: add a literal HTTP header. Authorization and Proxy-Authorization are rejected; credentials must live in auth:. The value is sent verbatim.
When to use it: a custom header expected by the server (X-Tenant-ID, X-Request-Source). Combine with --header-env when the value should come from an env var.

`--header-env NAME=VAR_NAME`

Type: repeatable header-name to env-var-name pair
Default: empty
Description: add an env-backed HTTP header. The runner reads VAR_NAME from the environment at connect time and uses its value as the header value. Both names must be non-empty.
When to use it: secret-flavored custom headers (multi-tenant SaaS keys, internal trace tokens) you do not want to put on the command line.

`--insecure-skip-verify`

Type: boolean flag
Default: off
Description: disable TLS certificate verification. Dangerous; only use against a private staging endpoint with a self-signed certificate. The runner prints a banner whenever this flag is set so the operator can spot a misconfigured CI job.
When to use it: local staging environments behind a self-signed cert. Do not use against production endpoints, ever.

`--ca-bundle <PATH>`

Type: filesystem path to a PEM-encoded CA bundle
Default: system trust store
Description: path to a PEM-encoded CA bundle for HTTP transport.
When to use it: your organization runs an internal CA whose root is not in the system trust store. Preferable to --insecure-skip-verify.

`--http-timeout <SECONDS>`

Type: integer seconds
Default: value from server.http.timeout, or the runner default
Description: HTTP per-request timeout, in seconds. Overrides server.http.timeout.
When to use it: a slow upstream needs longer per-request budgets, or you want a strict bound for CI fairness.

`--connect-timeout <SECONDS>`

Type: integer seconds
Default: value from server.http.connect_timeout, or the runner default
Description: HTTP connect timeout, in seconds. Overrides server.http.connect_timeout.
When to use it: behind a flaky NAT or VPN; a tighter connect budget will trip faster than waiting for the per-request timeout.

Proxy

Proxy flags apply to every outbound HTTP client mcptest builds: the StreamableHTTP and legacy SSE transports for MCP servers, plus the LLM provider clients (Anthropic, OpenAI, Google, Mistral, and any custom OpenAI-compat provider declared under providers:).

When no flag is set, reqwest reads HTTP_PROXY, HTTPS_PROXY, and NO_PROXY from the environment, so users behind a corporate proxy who already export those variables get the right behavior without changing anything. Use the flags below to override or disable.

`--proxy <URL>`

Type: URL string
Default: unset; reqwest reads HTTP_PROXY / HTTPS_PROXY from the environment
Description: catch-all proxy for both HTTP and HTTPS targets.
When to use it: a corporate proxy that handles both schemes.

`--http-proxy <URL>`

Type: URL string
Default: unset
Description: HTTP-only proxy. Wins over --proxy for plain-HTTP targets.
When to use it: separate proxies per scheme.

`--https-proxy <URL>`

Type: URL string
Default: unset
Description: HTTPS-only proxy. Wins over --proxy for HTTPS targets.
When to use it: most corporate setups route only HTTPS through a TLS-terminating CONNECT proxy.

`--no-proxy`

Type: boolean flag
Default: off
Description: disable every proxy, including reqwest's automatic env-var pickup. Mutually exclusive with --proxy, --http-proxy, and --https-proxy.
When to use it: the shell has a system-wide HTTPS_PROXY set but this single run should go direct.

`--noproxy <HOSTLIST>`

Type: comma-separated hostnames or domain patterns
Default: empty
Description: hosts to bypass the proxy for. Each entry is an exact hostname or a leading-dot suffix (.internal.example). Mirrors NO_PROXY semantics.
When to use it: route everything through the corporate proxy except your localhost test server or internal staging domain.

Verify what is in effect with mcptest doctor (prints a one-line proxy: summary) or mcptest run --print-config (includes the same summary plus the resolved test list).

Upload reporter

These flags are read by mcptest report --format upload. They parse on every command but have no effect outside that path.

`--upload-endpoint <URL>`

Type: URL string
Default: none
Description: endpoint to POST the canonical run envelope to. The body shape is documented in schemas/wire/v0.json.
When to use it: shipping run results to a collector (an mcptest-cloud instance, an internal observability service).

`--upload-token-env <NAME>`

Type: env variable name
Default: none
Description: name of the env var that holds the bearer token for the upload reporter. Read at config-build time and sent verbatim as Authorization: Bearer <value>.
When to use it: the collector requires authentication. Keep the token out of the command line and out of CI logs by referencing it through an env var.

`--upload-organization <NAME>`

Type: string
Default: none
Description: optional organization tag included in the upload envelope.
When to use it: multi-tenant collectors that need to attribute runs to a specific organization. The collector treats this as an opaque label.

Subcommands

`run`

mcptest [GLOBAL_OPTIONS] run

Description. Run the test suite. This is the primary command and the one you will type most often.

The runner loads the YAML config (default ./mcptest.yaml or whatever --config points to), resolves variables, applies any server-target overrides, and executes each test against the configured server. Results are printed by the reporter selected with --reporter (default pretty).

Arguments.

Argument	Type	Required	Description
`--update-snapshots`, `-u`	flag	optional	Rewrite every snapshot fixture encountered during the run.
`--allow-update-in-ci`	flag	optional	Permit `--update-snapshots` even when `CI=true` is set.
`--models <ID,ID,...>`	list	optional	Run the suite as a model matrix: every agent test fans across this comma-separated list (one cell per model), and the run defaults to the `matrix` reporter. See matrix-reporter.md.
`--no-verdict-cache`	flag	optional	Disable the LLM-judge verdict cache for this run. Overrides `evals.cache.verdicts: true` in YAML.
`--coverage`	flag	optional	Record per-tool, per-argument, per-error-path, and per-transport coverage during the run. Folds the result into JSON, pretty, markdown, and HTML reporters.
`--coverage-threshold <SPEC>`	string	optional	Quality gate against the coverage report. Accepts `tools=80,args=60,error_paths=50,transports=100`. Exits with code `6` when any dimension falls below its threshold. Requires `--coverage`.
`--no-cache`	flag	optional	Bypass the content-addressed cache for this run. Equivalent to passing both `--no-cache-read` and `--no-cache-write`.
`--no-cache-read`	flag	optional	Write fresh entries but ignore existing ones. Equivalent to "refresh the cache on this run."
`--no-cache-write`	flag	optional	Read existing entries but do not update the cache. Equivalent to "freeze the cache for this run."
`--cache-filter <SET>`	string	optional	Restrict execution to tests matching the named cache set. `NEW` is the only value v1 ships.
`--record`	flag	optional	For agent tests, dispatch every model live and overwrite the cassette on disk. Default behavior replays the cassette when present.
`--bail`, `-x`	flag	optional	Stop the runner after the first failing test. Subsequent tests are reported as skipped.
`--maxfail <N>`	integer	optional	Stop after the Nth failing test. Implies `--bail` when `N=1`.
`--collect-only`, `--list-tests`	flag	optional	Print discovered tests and exit `0` without running them. Honors `--filter`. Useful for verifying a selector before a long run.
`--pass-with-no-tests`	flag	optional	Treat "zero tests selected" as success. Without this flag, a run that picks nothing (after `--filter`, `--shard`, or `--last-failed`) exits `7`.
`--shard <INDEX/TOTAL>`	string	optional	Run a deterministic slice of the discovered tests. One-based; `--shard 1/3` runs the first third. The partition is stable across runs. Pair with `--pass-with-no-tests` so a worker with an empty slice does not fail the CI matrix.
`--last-failed`, `--lf`	flag	optional	Run only the tests that failed on the previous invocation. Reads `.mcptest/last-run.json` (rewritten after every run).
`--failed-first`, `--ff`	flag	optional	Reorder the test list so previous failures run first. Every selected test still runs. Mutually exclusive with `--last-failed`.
`--print-config`	flag	optional	Print the resolved suite (servers, providers, budget, selected tests) and exit. Provider API key env vars are listed by name, never resolved values.
`--tag <NAME>`	string (repeatable)	optional	Run only tests whose YAML `tags:` list contains NAME. Multiple `--tag` flags are OR'd.
`--skip-tag <NAME>`	string (repeatable)	optional	Drop tests whose `tags:` list contains NAME. Applied after `--tag` so a test matching both is dropped.
`--random`	flag	optional	Shuffle the test order to surface hidden ordering dependencies. The seed is logged so you can reproduce with `--seed N`. Conflicts with `--failed-first`.
`--seed <N>`	integer	optional	Pin the shuffle seed for `--random`. Implies `--random`.
`--changed`	flag	optional	Run only tests whose YAML or referenced cassette changed in the git working tree against `--changed-base` (default `origin/main`). Outside a git repo the selection is empty, so pair with `--pass-with-no-tests`.
`--changed-base <REF>`	git ref	optional	Base ref for `--changed`. Default `origin/main`.
`-- <SERVER_ARG>...`	trailing args	optional	Any argument after `--` is appended to every stdio server's command line. Ignored for HTTP / SSE servers.

Agent-test env vars. Auto-detected provider families read these at run time. Missing keys are not an error; the runner falls back to a deterministic stub so CI stays green.

Family	Env var	Notes
Anthropic	`ANTHROPIC_API_KEY`	Claude models (`claude-*`).
OpenAI	`OPENAI_API_KEY` plus optional `OPENAI_ORG_ID`	`gpt-`, `chatgpt-`, `o<digit>`, `text-`, `davinci-*`.
Google	`GEMINI_API_KEY` (falls back to `GOOGLE_API_KEY`)	`gemini-*`.
Mistral	`MISTRAL_API_KEY`	`mistral-`, `codestral-`, etc.

Custom OpenAI-compatible endpoints (Azure, OpenRouter, vLLM, LiteLLM, Together, Groq, Anyscale, Fireworks) are declared under top-level providers: in the YAML and reference whatever env var name you give them. See docs/models.md.

Examples.

# Smallest possible invocation; uses ./mcptest.yaml.
mcptest run

# Run a specific suite with JUnit output for CI.
mcptest --config tests/mcp.yaml --reporter junit --output reports/junit.xml run

# Point the suite at a preview deploy and wait for readiness.
mcptest --server-url https://preview-42.example.com \
        --wait-for-ready=2m \
        run

# Iterate on a single tag locally with debug logging.
mcptest --debug --filter '@smoke' run

# Record coverage and gate the run on a two-dimension threshold.
mcptest run --coverage --coverage-threshold tools=80,args=60

# Record agent cassettes against every key set in the environment.
ANTHROPIC_API_KEY=... OPENAI_API_KEY=... mcptest run --record

# Show which tests the runner would execute without running them.
mcptest --filter weather run --collect-only

# Stop at the first failure (developer's inner loop).
mcptest run --bail

# Iterate on yesterday's failures only; bail on the first one.
mcptest run --last-failed --bail

# CI matrix: three workers each take a stable third of the suite.
mcptest run --shard 1/3 --pass-with-no-tests

# See what the runner would dispatch without connecting to anything.
mcptest run --print-config

# Gate the run on a saved timing baseline.
mcptest run --check-baseline mcptest.timing-baseline.yml --tolerance-pct 25

Timing baseline (--check-baseline). Compares the run against a saved baseline file (written by mcptest baseline update). Tests whose elapsed milliseconds exceed p90 * (1 + tolerance_pct / 100) print a regression line and exit non-zero. The default tolerance is 0 (any overrun is a regression); a busy CI runner may want 25 or 50 to absorb noise. Cache hits and skipped tests are excluded from the check.

Exit codes. 0 on success, 1 on any test failure or baseline regression, 2 on configuration error, 3 on --wait-for-ready timeout, 6 on --coverage-threshold miss, 7 on "no tests selected" (use --pass-with-no-tests to treat as success), 124 on a hard runner timeout.

`baseline`

Manage the timing-baseline file the mcptest run --check-baseline gate reads. v1 ships one subcommand:

mcptest baseline update [--samples 20] --from-report run.json BASELINE

update reads a saved canonical JSON run report (mcptest run --reporter json --output run.json produces one) and folds each non-skipped, non-cache-hit test row into the baseline file. When the baseline file does not exist, it is bootstrapped from the report; when it does, the per-test p50 / p90 are blended via a rolling window (--samples, default 20). Tests present in the baseline but absent from this run are preserved so a transient --tag filter does not silently delete them.

Workflow.

# Refresh the baseline from a clean run.
mcptest run --reporter json --output run.json
mcptest baseline update --from-report run.json mcptest.timing-baseline.yml

# Gate subsequent runs against the saved baseline.
mcptest run --check-baseline mcptest.timing-baseline.yml --tolerance-pct 25

`conformance`

Score a running MCP server against the vendored SEP corpus, refresh the corpus from upstream, or list which check ids the corpus carries. Three subcommands:

mcptest conformance run [FLAGS]          # score against the corpus
mcptest conformance refresh [FLAGS]      # pull the latest SEPs
mcptest conformance check-ids [FLAGS]    # list check ids

The corpus ships baked into the binary at compile time, so a cargo install mcptest user can score offline. The resolution order for the corpus directory is: --corpus-dir if set, else the XDG user cache (~/.cache/mcptest/conformance/), else the embedded fallback. Each subcommand surfaces corpus_source on its report so a reader can tell which path the run used.

conformance run flags.

Flag	Default	Purpose
`--server <URL>`	none	MCP server to probe. Wire-probe integration ships as a follow-up; the v1 report scores from the corpus content only.
`--target-version <V>`	latest available	Which spec revision to score. Defaults to the lexicographically greatest version available locally.
`--corpus-dir <PATH>`	(resolution order)	Override the corpus location.
`--format <FORMAT>`	`pretty`	One of `pretty`, `json`, `markdown`, `html`.
`--out <PATH>`	stdout	Where to write the report.
`--auto-refresh`	off	Trigger `refresh` when the requested version is missing locally. Off by default so a `run` never silently makes a network call.

conformance refresh flags.

Flag	Default	Purpose
`--target-version <V>`	`latest`	Spec version to fetch. `latest` resolves to the newest entry in `mcptest_core::conformance::releases::KNOWN_RELEASES`.
`--corpus-dir <PATH>`	user cache	Destination. Never special-cases the in-repo `crates/mcptest-core/seps/` path; maintainers pass it explicitly when refreshing the vendored copy.
`--url <URL>`	upstream	Override the upstream repository.
`--ref <REF>`	(from `KNOWN_RELEASES`)	Pin to a specific tag or SHA.
`--source-path <PATH>`	`src/seps`	Subdirectory in the upstream tree to mirror.
`--dry-run`	off	Print what would be fetched and where, without writing.

The refresh transport is an HTTPS GET to codeload.github.com/<owner>/<repo>/tar.gz/<ref>, extracted in memory. Set GITHUB_TOKEN to lift GitHub's 60-req/hr anonymous rate limit.

conformance check-ids flags.

Flag	Default	Purpose
`--target-version <V>`	latest available	Which corpus to inspect.
`--corpus-dir <PATH>`	(resolution order)	Override the corpus location.
`--missing-only`	off	Print only the unimplemented check ids.
`--format <FORMAT>`	`pretty`	One of `pretty` or `json`.

# Score the embedded corpus and write the JSON envelope.
mcptest conformance run --format json --out report.json

# Refresh the user cache from upstream (no token needed for the
# default rate-limited path).
mcptest conformance refresh

# List the check ids the runner has not implemented yet.
mcptest conformance check-ids --missing-only

`init`

mcptest [GLOBAL_OPTIONS] init [--with-jury] [--force] [--url URL | --from-discovered NAME]

Description. Scaffold a new mcptest project in the current directory. Creates a starter tests/example.yaml and a mcptest.yml config. Safe to run inside an empty directory; refuses to overwrite existing files unless --force is supplied. --url forks to a URL-target template; --from-discovered scaffolds from a server found in a local MCP client config (see doctor).

Arguments.

Argument	Type	Required	Description
`--with-jury`	flag	optional	Append a v1.0 LLM-judge example to `tests/example.yaml`. The block is marked as a v1.0 feature in a comment so users understand it is forward-looking.
`--force`	flag	optional	Overwrite existing files. Default behavior is to refuse.
`--from-discovered <NAME>`	string	optional	Scaffold a stdio suite from a server discovered in a local MCP client config (the names `mcptest doctor` lists). Conflicts with `--url`. Env-var names are surfaced as a comment; their values are never copied into the scaffold.

Examples.

# Scaffold a project in the current directory.
mkdir my-mcp-tests && cd my-mcp-tests
mcptest init

# Include the v1.0 jury example.
mcptest init --with-jury

# Scaffold from a server already configured in a local MCP client.
mcptest init --from-discovered github

# Overwrite stale scaffolding.
mcptest init --force

Exit codes. 0 on success, 2 when a target file already exists and --force was not supplied.

Status. Working.

`doctor`

mcptest [GLOBAL_OPTIONS] doctor [--no-tool-tokens] [--tokenizer NAME]

Description. Diagnose the local environment and server connectivity. Lists which dotenv files were discovered, how many variables came from each source, and (when wired) the cost of the server's tools/list catalog measured in tokens. It also prints a test-readiness inventory of the MCP servers found across local client configs (Claude Desktop, Claude Code, Cursor, VS Code, Windsurf, Codex), showing identity, transport, and presence-of-auth only, with secrets redacted (see server discovery).

Alongside the token total, doctor reports a tool-search posture signal: friendly or heavy. A catalog is friendly when its token cost is at or under 20K and it advertises ten tools or fewer; otherwise it is heavy. The threshold sits well under the roughly 55K-token cost of a five-server MCP setup that Anthropic's advanced-tool-use guidance describes, where real systems reach about 134K and the Tool Search Tool defers definitions to cut catalog token cost by about 85 to 95 percent. A friendly catalog is cheap enough to load up front; a heavy catalog is large enough that deferred loading (tool search) would pay off.

The pure computations behind doctor (env discovery, tokenizer accounting, posture classification) are fully unit-tested. The live tools/list probe is still pending.

Arguments.

Argument	Type	Required	Description
`--no-tool-tokens`	flag	optional	Disable the tool catalog token cost check. Use when the server is unreachable and you want the rest of the doctor report.
`--tokenizer <NAME>`	string	optional (default `cl100k_base`)	Override the tokenizer used for the catalog token cost check. Supported: `cl100k_base` (GPT-3.5/4), `o200k_base` (GPT-4o), `gpt2`, `claude` (currently aliased to `cl100k_base`), and `whitespace` (transport-free approximation).
`--lint-descriptions`	flag	in-flight	Run the catalog description quality lint. Not present in the v1.0 binary; it will get its own exit code when the upcoming release ships.

Examples.

# Default doctor run.
mcptest doctor

# Skip the live tools/list probe (offline triage).
mcptest doctor --no-tool-tokens

# Use a specific tokenizer for the catalog cost report.
mcptest doctor --tokenizer o200k_base

# Combine with --verbose to also see resolver decisions.
mcptest --verbose doctor

Exit codes. 0 when the report renders successfully. 1 is reserved for a doctor probe that fails outright. 7 is reserved for --lint-descriptions failures.

Migration probe (--target-version). Pair --url with --target-version 2026-07-28 to run the migration doctor. It adds a one-shot initialize probe after the regular pipeline and reports one row per breaking change from the migration pair-corpus. v1 detects the deprecated capabilities (Roots / Sampling / Logging); other categories surface as [SKIP] with a one-line rationale and a follow-up ticket reference (stateless transport, schema validator, auth pack). Pair with mcptest lint for the offline YAML and cassette scan. A [FAIL] row gates CI (exit 1).

mcptest doctor --url https://mcp.example.com --target-version 2026-07-28

Status. Working. The tool catalog token check is wired but the live tools/list call is still pending; until then the handler reports that the check is wired but not yet runnable. The tool-search posture signal is computed from the same catalog token cost. --lint-descriptions is in-flight.

`validate`

mcptest [GLOBAL_OPTIONS] validate

Description. Validate the YAML config against the published JSON Schema (schemas/v1.json). Useful as a pre-commit hook and as the first step in any CI pipeline: it catches typos and structural mistakes before any test runs.

The path to validate is taken from the global --config flag so behavior is consistent with run.

Arguments.

Argument	Type	Required	Description
(none)			The file to validate is taken from `--config` or `./mcptest.yaml`.

Examples.

# Validate the default ./mcptest.yaml.
mcptest validate

# Validate a specific suite (useful in a multi-suite repo).
mcptest --config tests/integration/mcp.yaml validate

# Run validate as a pre-commit step.
git diff --cached --name-only | grep -q '\.ya\?ml$' && mcptest validate

Exit codes. 0 when the file parses and validates. 2 on a schema violation, broken ${VAR} reference, missing import, or unreadable file. Every finding from every layer is reported in a single pass.

Status. Working.

`schema`

mcptest [GLOBAL_OPTIONS] schema [--version v1]

Description. Emit the JSON Schema for the YAML config to stdout. Output is byte-equivalent to https://mcptest.sh/schema/v1.json. Use it to wire mcptest into IDEs (VS Code's YAML extension, IntelliJ) so authors get autocomplete and inline validation while they type.

Arguments.

Argument	Type	Required	Description
`--version <VERSION>`	string	optional, default `v1`	Schema version to emit. Only `v1` is shipped today; a future `v2` will land as a separate match arm.

Examples.

# Print the schema to stdout.
mcptest schema

# Pipe it into a tool that consumes JSON Schema.
mcptest schema > .vscode/mcptest.schema.json

# Validate ad-hoc YAML against the schema using a third-party tool.
mcptest schema | check-jsonschema --schemafile - my-tests.yaml

Exit codes. 0 on success. 2 when --version names an unknown schema revision.

Status. Working.

`coverage`

mcptest [GLOBAL_OPTIONS] coverage [--threshold PERCENT] [--format FORMAT]

Description. Compute per-tool and per-resource coverage metrics for the server's surface. Reports which tools and resources were exercised by the test suite and which were skipped, so authors can spot dead corners of their server.

The pure computation in mcptest_core::coverage is fully unit-tested. Running it requires a runner that records which tools and resources were exercised at execution time.

Arguments.

Argument	Type	Required	Description
`--threshold <PERCENT>`	float, 0 to 100	optional	Quality gate threshold as a percentage. When set and the computed coverage is below the threshold, the runner exits with code `6`.
`--format <FORMAT>`	enum `{pretty, json}`	optional, default `pretty`	Output format for the coverage report. `pretty` renders a human-friendly table; `json` emits the structured `CoverageReport`.

Examples.

# Local check after a test run, pretty output.
mcptest coverage

# Hard gate at 80% coverage in CI.
mcptest coverage --threshold 80

# Machine-readable JSON for downstream tooling.
mcptest coverage --format json > coverage.json

# Combine with --filter to scope coverage to a subset.
mcptest --filter '@public' coverage --threshold 90

Exit codes. 0 when coverage is computed (and meets the threshold when one is set). 6 when --threshold is set and the report does not meet it.

Status. Stub in v1.0. The handler prints the requested threshold and format, then exits 0. Live wiring lands when the runner records exercised tools and resources.

`report`

mcptest [GLOBAL_OPTIONS] report <INPUT> [--format FORMAT] [--output PATH]

Description. Re-render a previously-saved JSON run as another reporter format. Saves a re-run when CI already captured the canonical JSON but a different consumer (a PR comment, a GitHub job summary, a SARIF importer) wants a different shape.

The input is the JSON written by mcptest run --output run.json --reporter json or any equivalent invocation. The redaction policy is re-applied at the dispatch site so every output shape shares one redacted view.

Arguments.

Argument	Type	Required	Description
`<INPUT>`	filesystem path	yes	Path to a JSON report previously written by `mcptest run`.
`--format <FORMAT>`	enum, see below	optional, default `pretty`	Reporter format to render in.
`--output <PATH>`	filesystem path	optional	Write to this file instead of stdout.

Accepted --format values:

Format	Description
`pretty`	Human-friendly text output (default).
`json`	Pretty-printed JSON, round-trippable through the same model.
`junit`	JUnit XML suitable for `dorny/test-reporter` and CircleCI Insights.
`md`	GitHub-flavored Markdown for PR comments and job summaries.
`html`	Single-file HTML report with inline CSS.
`sarif`	SARIF 2.1.0 for GitHub code-scanning and similar consumers. See sarif-reporter.md.
`gitlab`	GitLab Code Quality JSON for merge request widgets. See gitlab-code-quality.md.
`ndjson`	Newline-delimited JSON: one `test` record per line, then a `summary`. For log pipelines and `jq -c`.
`tap`	Test Anything Protocol v14, for `prove`/`tappy`-style consumers.
`matrix`	Self-contained HTML test-by-model comparison grid. See matrix-reporter.md.
`matrix-md`	The comparison grid as GitHub-flavored Markdown.
`upload`	HTTPS upload of the canonical run envelope to `--upload-endpoint` (preview).

Examples.

# Re-render a saved run as JUnit XML for CI.
mcptest report run.json --format junit --output junit.xml

# Produce a Markdown summary for a PR comment.
mcptest report run.json --format md --output pr-summary.md

# Generate an HTML report for sharing in Slack.
mcptest report run.json --format html --output run.html

# Ship a run envelope to a collector.
mcptest --upload-endpoint https://collector.example.com/v1/runs \
        --upload-token-env COLLECTOR_TOKEN \
        --upload-organization acme \
        report run.json --format upload

# Print the round-trippable JSON to stdout.
mcptest report run.json --format json

Exit codes. 0 on a successful render or upload. 1 when the input file is missing or malformed, or when the upload CLI is misconfigured (no endpoint, bad URL). 2 when the collector returns an error or declines the upload.

Status. Working. Every format above is wired. The upload format is documented as a preview because the wire envelope schema is not yet finalized.

`eval`

mcptest [GLOBAL_OPTIONS] eval [--max-cost USD] [--no-verdict-cache] [--explain]

Description. Run quality evaluations against an MCP server using an LLM judge. v1.0 ships single-judge mode: every entry in the evals: block is graded by one juror, the verdict and rationale are pretty-printed, and a cost budget tracks cumulative spend. Multi-juror consensus, bias mitigations, and inter-juror agreement are v1.0 features.

Arguments.

Argument	Type	Required	Description
`--max-cost <USD>`	float	optional	Hard ceiling in USD across every LLM-judge call. Accepts an optional leading `$`. The runner stops dispatching new tests once cumulative spend would exceed the cap.
`--no-verdict-cache`	flag	optional	Disable the LLM-judge verdict cache for this run. Overrides `evals.cache.verdicts: true` in YAML.
`--explain`	flag	optional	Print what each eval would grade (rubric, candidate source, judge model, judge-call count) without calling any provider or spending tokens, then exit.

Examples.

# Run every eval in mcptest.yaml.
mcptest eval

# Cap total LLM-judge spend at one dollar.
mcptest eval --max-cost $1.00

# Force fresh verdicts even when the YAML opts into caching.
mcptest eval --no-verdict-cache

Exit codes. 0 on success. 1 on a failed evaluation. 5 when the configured cost cap is exceeded.

Status. Working single-judge mode. Multi-juror consensus follows in v1.0.

`diff`

mcptest [GLOBAL_OPTIONS] diff <OLD> <NEW> [--format FORMAT] [--fail-on-breaking BOOL] [--scorecard]

Description. Diff two saved tools/list JSON snapshots and report which tools were added, removed, or reshaped, flagging each change as breaking or non-breaking so a CI job can fail loudly on a real regression. The --scorecard flag appends a release grade summarizing the diff.

Arguments.

Argument	Type	Required	Description
`<OLD>`	filesystem path	yes	Path to the old snapshot (the baseline).
`<NEW>`	filesystem path	yes	Path to the new snapshot (the candidate).
`--format <FORMAT>`	enum `{pretty, json, markdown}`	optional, default `pretty`	Output format for the diff report.
`--fail-on-breaking <BOOL>`	boolean	optional, default `true`	Exit code `1` when at least one change is breaking. Set to `false` for advisory CI output rather than a hard gate.
`--scorecard`	flag	optional, default off	Append a release scorecard (A+ / A / B / C / D / F letter grade plus per-tool added / removed / regressed callouts) to the diff output.

Scorecard grading. Aggregates the diff into one letter grade:

Grade	Trigger
`A+`	No changes at all. Identical snapshots.
`A`	At least one safe change, no breaking changes.
`B`	Exactly one breaking change.
`C`	Two or three breaking changes.
`D`	Four or more breaking changes.
`F`	Any tool removed between old and new (the most disruptive change a server can ship).

The grade matches the spirit of the compliance grade table so the two scorecards line up in marketing material.

Examples.

# Local diff against the previous saved snapshot, pretty output.
mcptest diff snapshots/old.json snapshots/new.json

# Hard fail in CI when any change is breaking.
mcptest diff snapshots/main.json snapshots/pr.json --fail-on-breaking true

# Advisory output for a PR comment (does not fail the job).
mcptest diff snapshots/main.json snapshots/pr.json \
        --format markdown \
        --fail-on-breaking false > pr-comment.md

# Machine-readable JSON for downstream tooling.
mcptest diff snapshots/main.json snapshots/pr.json --format json

Exit codes. 0 when there are no breaking changes (or --fail-on-breaking false is set). 1 when --fail-on-breaking is true and at least one change is breaking, or when a snapshot file is missing or malformed.

Status. Working. The diff engine lives in mcptest_core::diff; the CLI handler loads the JSON snapshots, calls diff_tool_catalogs, and renders the report.

`lint`

mcptest [GLOBAL_OPTIONS] lint [PATH...] [--format pretty|json] [--no-fail]

Scan YAML suites and JSON cassettes for usage of features the MCP 2026-07-28 spec deprecates. One-shot, offline: no server is contacted. Walks every *.yaml / *.yml / *.json file under each PATH (defaults to the current directory) and emits one finding per hit. target/, node_modules/, and .git/ are skipped.

Detected patterns:

Kind	Trigger	Replacement
`roots`	`roots/list` method	Tool parameters, resource URIs, or server configuration.
`sampling`	`sampling/createMessage` method	Direct integration with LLM provider APIs.
`logging`	`logging/setLevel` or `notifications/message` method	`stderr` for stdio servers; OpenTelemetry for structured observability.
`tasks-list`	`tasks/list` method (removed in the Tasks extension lifecycle)	`tasks/get` polling on a server-issued task handle.
`legacy-error-code`	Literal `-32002` error code	Standard JSON-RPC `-32602` Invalid Params code.

Exit code. 0 when no findings land or with --no-fail. 1 when any finding is reported. Useful as a CI gate.

Example.

mcptest lint examples/ docs/
mcptest lint --format json suites/ > deprecations.ndjson
mcptest lint --no-fail .  # advisory mode

Status. Working. The migration doctor will add a live probe that complements this offline scan.

`migrate`

mcptest [GLOBAL_OPTIONS] migrate [PATH...] [--to 2026-07-28] [--write]

Rewrite YAML suites toward an MCP spec revision. v1 covers the 2026-07-28 target only; any other --to value is a clear error. Two kinds of rewrite ship:

Annotation. Every deprecated-feature hit gets a # TODO(mcptest-migrate) comment inserted immediately above the offending line, pointing at the replacement guidance from the migration corpus. The original line is left intact so the file still parses and so the operator can apply the human-judgement rewrite.
Mechanical rewrite. The legacy -32002 JSON-RPC error code has a safe one-to-one replacement (-32602 Invalid Params). The migrator annotates the line with a TODO and substitutes the literal token.

The same deprecation catalog mcptest lint uses drives the migrator, so a run that lints clean migrates as a no-op.

Flags.

Flag	Default	Description
`--to <VERSION>`	`2026-07-28`	Target spec revision. v1 supports `2026-07-28` only.
`--write`	off	Apply the rewrites in place. Default is dry-run (print the per-file action plan).

Exit code. 0 on a successful run (writes applied or dry-run completed). 2 if --to names an unsupported target version.

Example.

mcptest migrate examples/             # dry-run, prints what would change
mcptest migrate --write suites/       # apply the rewrites in place

Status. Working: YAML annotation plus the legacy error-code rewrite. Cassette rewrites land in a later release alongside the streamable-HTTP transport rerouting.

`discover`

mcptest [GLOBAL_OPTIONS] discover <SERVER> [--output PATH] [--bearer-token-env NAME]

Description. Connect to an MCP server, run the initialize handshake, and call tools/list, resources/list, and prompts/list. Pretty-prints the discovered capabilities to stderr and writes a starter tests.yaml with one smoke test per tool.

Arguments.

Argument	Type	Required	Description
`<SERVER>`	URL or `name=url` pair	yes	MCP server endpoint. Bare URLs are labelled `discovered`; `name=https://...` lets you pick a friendlier server name in the scaffolded YAML.
`--output <PATH>`	filesystem path	optional, default `tests.yaml`	Path for the scaffolded suite.
`--bearer-token-env <NAME>`	string	optional	Read a bearer token from this env var and send it as `Authorization: Bearer <value>` during the probe.

Examples.

# Scaffold against a local server.
mcptest discover http://localhost:8080/mcp

# Scaffold against an authenticated server, into a custom path.
MCP_TOKEN=abc mcptest discover https://api.example.com/mcp \
        --bearer-token-env MCP_TOKEN \
        --output suites/example.yaml

Exit codes. 0 on success, 1 when the handshake fails or the server is unreachable.

Status. Working.

`completions`

mcptest [GLOBAL_OPTIONS] completions <SHELL>

Description. Emit a shell completion script for the chosen shell. Lives in this subcommand rather than as a flag so users can pipe straight into a shell init file.

Arguments.

Argument	Type	Required	Description
`<SHELL>`	enum `{bash, zsh, fish, powershell, elvish}`	yes	Shell to emit completions for. Backed by `clap_complete::Shell`.

Examples.

# Bash, system-wide.
mcptest completions bash | sudo tee /etc/bash_completion.d/mcptest > /dev/null

# Zsh, per-user.
mcptest completions zsh > ~/.zsh/completions/_mcptest

# Fish, per-user.
mcptest completions fish > ~/.config/fish/completions/mcptest.fish

# PowerShell.
mcptest completions powershell | Out-String | Invoke-Expression

Exit codes. 0 on success. clap rejects unknown shells with its standard error path (exit code 2).

Status. Working. The handler is a single call into clap_complete.

`model-compat`

mcptest [GLOBAL_OPTIONS] model-compat <SUBCOMMAND>

Description. Capture, diff, and replay model-compatibility baselines. The v1.0 headline workflow: snapshot the suite against a model, then compare a later run against that snapshot to classify every assertion as PASS, DRIFT, or FAIL.

Subcommands.

Subcommand	Purpose
`capture`	Write a baseline JSON file for the given model.
`diff`	Compare two saved baselines and render the result.
`run`	Re-run the suite against a candidate and diff against the baseline.

mcptest model-compat capture.

mcptest model-compat capture --model ID --output PATH --input PATH [--filter GLOB]

Argument	Type	Required	Description
`--model <ID>`	string	yes	Provider-qualified model identifier (for example `anthropic:claude-sonnet-4.5`).
`--output <PATH>`	filesystem path	yes	Destination for the baseline JSON.
`--input <PATH>`	filesystem path	yes	Source baseline file from the runner. The live runner integration is a follow-up ticket; for v1.0 this flag lets `capture` ride on a pre-built baseline so the CLI surface is exercised end to end.
`--filter <GLOB>`	string	optional	Narrow the captured assertion list with a `*`-style glob.

mcptest model-compat diff.

mcptest model-compat diff <BASELINE> <CANDIDATE> [--format FORMAT] [--filter GLOB]

Argument	Type	Required	Description
`<BASELINE>`	filesystem path	yes	The baseline (left-hand side).
`<CANDIDATE>`	filesystem path	yes	The candidate (right-hand side).
`--format <FORMAT>`	enum `{pretty, json}`	optional, default `pretty`	Output format. JSON renders the full `BaselineDiff`.
`--filter <GLOB>`	string	optional	Narrow which assertion ids appear in the diff.

mcptest model-compat run.

mcptest model-compat run --baseline PATH --model ID --candidate PATH

Argument	Type	Required	Description
`--baseline <PATH>`	filesystem path	yes	Saved baseline to diff against.
`--model <ID>`	string	yes	Candidate model identifier; printed in the report header.
`--candidate <PATH>`	filesystem path	yes	Pre-captured candidate baseline. The live runner lands in a follow-up ticket; for v1.0 this flag lets `run` exercise the PASS/DRIFT/FAIL exit handling.

Examples.

# Capture a baseline against the current production model.
mcptest model-compat capture \
        --model anthropic:claude-sonnet-4.5 \
        --output baselines/sonnet-4.5.json \
        --input artifacts/last-run.json

# Compare two saved baselines as JSON for a CI gate.
mcptest model-compat diff baselines/sonnet-4.5.json baselines/sonnet-5.0.json --format json

# Run the suite against a candidate and exit per the rubric.
mcptest model-compat run \
        --baseline baselines/sonnet-4.5.json \
        --model anthropic:claude-sonnet-5.0 \
        --candidate artifacts/sonnet-5.0.json

Exit codes. 0 PASS (every assertion classified PASS). 6 DRIFT (at least one DRIFT, no FAIL). 1 FAIL (any invariant violated or assertion missing).

Status. Working. Library entry points live in mcptest_core::model_compat; the CLI dispatches to commands::model_compat.

`compliance`

mcptest [GLOBAL_OPTIONS] compliance <SUBCOMMAND>

Description. Score an MCP server against the compliance rubric and render the result in one of four formats. The score is the same ComplianceScore from mcptest_core::compliance::scoring; the four renderers in mcptest_core::compliance::renderers (pretty, JSON, Markdown, HTML) own the presentation. When a baseline is supplied the four BaselineDecision outcomes drive the exit code so CI can stay green while a known set of MUSTs is still pending.

Subcommands.

Subcommand	Purpose
`run`	Run the compliance corpus and render the score.
`invariants`	Evaluate spec-derived conformance invariants over a captured session, plus multi-server composition-safety checks.

mcptest compliance run.

mcptest compliance run \
        --results-from PATH \
        [--format FORMAT] \
        [--server-label LABEL] \
        [--registry PATH] \
        [--capabilities LIST] \
        [--baseline PATH | --expected-failures PATH] \
        [--update-baseline] [--yes]

Argument	Type	Required	Description
`--results-from <PATH>`	filesystem path	yes	JSON list of `CheckResult` records produced by the runner. The live runner integration lands in a follow-up ticket; for v1.0 this flag lets the score and reporter surface ride on a pre-built artifact.
`--format <FORMAT>`	enum `{pretty, json, markdown, html}`	optional, default `pretty`	Output format.
`--server-label <LABEL>`	string	optional	Label printed in the report header. Falls back to the global `--server-url` when omitted.
`--registry <PATH>`	filesystem path	optional, default `compliance/registry.yml`	Path to the rule registry YAML.
`--capabilities <LIST>`	comma list	optional	Capabilities the server declared during `initialize` (for example `tools,resources`). Drives section applicability.
`--baseline <PATH>`	filesystem path	optional	Baseline file listing expected failures. Layers the four `BaselineDecision` outcomes on top of the run.
`--expected-failures <PATH>`	filesystem path	optional	Alias for `--baseline`.
`--update-baseline`	flag	optional	Rewrite the baseline file from the current run after a confirmation prompt. Requires `--baseline` (or `--expected-failures`).
`--yes`	flag	optional	Skip the confirmation prompt for `--update-baseline`.

Exit codes. Without a baseline: 0 when no MUSTs failed, 1 when at least one did. With a baseline:

Decision	Meaning	Exit
`NormalPass`	Check passed and is not on the baseline.	`0`
`ExpectedFailure`	Check failed but is on the baseline.	`0`
`NewRegression`	Check failed and is NOT on the baseline.	`1`
`StaleBaseline`	Check passed but is still on the baseline.	`1`

Examples.

# Render the score as Markdown for a PR comment.
mcptest compliance run \
        --results-from artifacts/compliance.json \
        --format markdown \
        --capabilities tools,resources

# Gate CI on a baseline so known failures stay green.
mcptest compliance run \
        --results-from artifacts/compliance.json \
        --baseline compliance-baseline.yml

# Regenerate the baseline after a deliberate cleanup.
mcptest compliance run \
        --results-from artifacts/compliance.json \
        --baseline compliance-baseline.yml \
        --update-baseline --yes

mcptest compliance invariants.

mcptest compliance invariants --capture PATH [--format FORMAT]

Argument	Type	Required	Description
`--capture <PATH>`	filesystem path	yes	JSON capture file. A single session object runs the per-server invariants; an array of sessions also runs the multi-server composition-safety checks.
`--format <FORMAT>`	enum `{pretty, json}`	optional, default `pretty`	Output format.

The invariants are the INV-NNN family (handshake ordering, capability attestation, tool result-shape, JSON-RPC error envelopes). With two or more sessions the command also checks tool-namespace overlap and shared-transport id collisions. The capture is read from disk so the run is deterministic and contacts no server. Exit 0 when every invariant passes with no composition hazard, 1 otherwise. See docs/conformance-invariants.md.

Status. Working. Library entry points live in mcptest_core::compliance; the CLI dispatches to commands::compliance.

`pipe`

mcptest pipe <PIPELINE> [--url URL] [--bearer-token-env NAME]
                        [--var KEY=VALUE ...] [--format pretty|json]
                        [--dry-run [--estimate-cost]]
                        [--max-cost USD] [--max-tokens N]
                        [--max-cost-per-call USD] [--max-duration DURATION]
                        [--on-budget-exceeded stop|continue|warn]
                        [--pricing-table PATH]

Description. Run a declarative multi-step tool-call pipeline. Each step calls a tool, extracts values from earlier steps by reference, and binds them into later steps. The full pipeline YAML grammar, the reference expression language (${step.field}, ${var.X}, ${env.X}), the when: guard, on_error, and the cumulative budget controls are documented in reference/pipe.md.

Arguments.

Argument	Type	Required	Description
`<PIPELINE>`	path	yes	The pipeline YAML file.
`--url <URL>`	string	optional	MCP server the pipeline's tool calls target.
`--bearer-token-env <NAME>`	string	optional	Env var holding a bearer token for every request.
`--var <KEY=VALUE>`	repeatable	optional	Inject a variable referenced as `${var.KEY}`.
`--format <pretty\|json>`	enum	optional	`pretty` prints the last step's result; `json` prints the full execution trace.
`--dry-run`	flag	optional	Print the planned execution without making any tool calls.
`--estimate-cost`	flag	optional	With `--dry-run`, also print a projected cost estimate.
`--max-cost <USD>`	float	optional	Aggregate USD ceiling across all steps.
`--max-tokens <N>`	integer	optional	Aggregate token ceiling (input + output) across all steps.
`--max-cost-per-call <USD>`	float	optional	Per-step USD ceiling, layered on the aggregate cap.
`--max-duration <DURATION>`	duration	optional	Wall-clock ceiling (`30s`, `2m`, `1h`).
`--on-budget-exceeded <MODE>`	enum	optional	`stop` (default) fails fast, `continue` runs to completion then exits non-zero, `warn` runs and exits `0`.
`--pricing-table <PATH>`	path	optional	Override the bundled `pricing.yaml` used for cost estimates.

Example.

mcptest pipe examples/pipe-search-then-update.yml \
  --url http://localhost:8000/mcp --var USER_QUERY=alice --max-cost 0.50

Status. Working (pipelines and budgets).

`tools`, `resources`, `prompts`, `capabilities`

mcptest tools        --url URL [--bearer-token-env NAME] [--format pretty|json]
mcptest tools call <TOOL> --url URL [--args JSON | --args-from-stdin | --arg NAME=$.path ...]
                          [--bind NAME=$.path ...] [--then TOOL [--then-arg NAME=VALUE|NAME=:bound]...]
                          [--select $.path] [--json] [--max-cost USD]
mcptest resources    --url URL [--format pretty|json]
mcptest prompts      --url URL [--format pretty|json]
mcptest capabilities --url URL [--format pretty|json]

Description. Introspect a live server's catalog. The bare tools / resources / prompts forms list the catalog; capabilities prints the initialize capability block. tools call <TOOL> runs one tool imperatively with the chaining primitives so a shell pipeline can extract and forward values without reaching for jq.

tools call arguments.

Argument	Type	Description
`<TOOL>`	string	Tool to call.
`--url <URL>`	string	MCP server endpoint.
`--args <JSON>`	string	Literal args object as a JSON string.
`--args-from-stdin`	flag	Read the entire args object from stdin JSON. Conflicts with `--args`.
`--arg <NAME=$.path>`	repeatable	Extract a value from stdin JSON by JSONPath and use it as the named arg.
`--bind <NAME=$.path>`	repeatable	Capture a value from this call's output for a chained `--then` step.
`--then <TOOL>`	string	Chain a second tool call within the same invocation.
`--then-arg <NAME=VALUE>`	repeatable	Argument for the `--then` step. `name=:bound` references a `--bind` capture.
`--select <$.path>`	string	Project the output by JSONPath before printing.
`--json`	flag	Emit JSON instead of pretty output.
`--max-cost <USD>`	float	Aggregate USD ceiling across both calls in this invocation.

Example.

mcptest tools call search --url "$URL" --args '{"query":"alice"}' \
  --bind user_id=$.results[0].id \
  --then fetch_user --then-arg user_id=:user_id --select $.email

Status. Working (introspection and chaining).

`inspect`

mcptest inspect --url URL [--bearer-token-env NAME]
mcptest inspect -- <command> [args...]

Connect to one MCP server and explore it in an interactive REPL: the terminal sibling of the one-shot tools / resources / prompts / capabilities / discover commands. Target the server either over streamable HTTP (--url) or over stdio (everything after -- is the server command). The REPL reads commands from stdin, so a piped script drives it the same way an interactive session does.

REPL commands (type help to list them in-session):

Command	Action
`tools` / `ls`	List tools.
`call <tool> [json]`	Call a tool with a JSON-object argument (default `{}`).
`resources`	List resources.
`read <uri>`	Read a resource.
`prompts`	List prompts.
`prompt <name> [json]`	Get a prompt.
`capabilities` / `caps`	Show the server capability block.
`notifications` / `notif`	Show notifications received this session.
`discover [path]`	Scaffold a `tests.yaml` from the live catalog (default `tests.yaml`).
`quit` / `exit` / `q`	Leave the session.

Live server activity is surfaced as it arrives. Notifications (logging, progress, list_changed) print with a <- notification prefix. Server-initiated requests are fulfilled automatically so a server that drives the client can be exercised end to end: sampling/createMessage returns a stub assistant message (no real model call), elicitation/* is declined, and roots/list returns an empty list. inspect advertises the matching client capabilities during the handshake so the server knows it may use them.

Example.

# Stdio server
mcptest inspect -- npx -y @modelcontextprotocol/server-everything

# HTTP server, scripted (non-interactive)
printf 'tools\ncall search {"query":"alice"}\nquit\n' \
  | mcptest inspect --url "$URL" --bearer-token-env MCP_TOKEN

Status. Working (terminal-only; web viewers are out of scope).

`mcp-server`

mcptest mcp-server [--workspace PATH] [--enable-writes] [--mcptest-bin PATH]

Description. Run mcptest itself as a stdio MCP server so an MCP-aware agent (Claude Code, Cursor, mcp-inspector) can query runs, cassettes, coverage, and diagnostics from inside the editor. Read tools are always available; --enable-writes adds the run-triggering and cassette-recording tools. Full tool and resource catalog in mcp-server.md.

Arguments.

Argument	Type	Description
`--workspace <PATH>`	path	Workspace root. Defaults to the current directory.
`--enable-writes`	flag	Enable the write tools (`trigger_run`, `record_cassette`). Off by default.
`--mcptest-bin <PATH>`	path	Override the `mcptest` binary the write tools spawn.

Status. Working. Registered in an agent config as command: "mcptest", args: ["mcp-server", "--workspace", "."].

`generate`

mcptest generate stubs --url URL [--bearer-token-env NAME]
                       [--output DIR] [--overwrite] [--stdout] [--check]

mcptest generate suite --from-config FILE [--server-name NAME]
                       [--models ID,ID,...] [--no-edge] [--no-violation]
                       [--output PATH | --update PATH]

Description. Scaffold runnable YAML tests from a server's tool catalog. Wrapped under a subcommand so generators land as siblings.

generate stubs introspects a live server and emits one test stub file per advertised tool.

generate suite synthesizes one self-contained suite document from a server's declared tools: a servers: block, a multi-model matrix placeholder, and three cases per tool (valid arguments, a boundary edge case, and a schema-violation case that expects an error). The emitted YAML validates against schemas/v1.json, so it runs as written.

generate stubs arguments.

Argument	Type	Description
`--url <URL>`	string	MCP server endpoint to introspect.
`--bearer-token-env <NAME>`	string	Env var holding a bearer token forwarded to every request.
`--output <DIR>`	path	Directory the generated YAML is written under. Default `tests/tools`.
`--overwrite`	flag	Replace existing stub files instead of skipping them.
`--stdout`	flag	Print every stub concatenated to stdout instead of writing to disk.
`--check`	flag	Exit `6` if any generated stub differs from the checked-in file. CI drift detection.

generate suite arguments.

Argument	Type	Description
`--from-config <FILE>`	path	Read the declared tool list from a file: a `tools/list` JSON snapshot (`{"tools": [...]}`), a bare JSON tools array, or a YAML mock manifest (`mock_server.tools[]`).
`--server-name <NAME>`	string	Server key the generated tests reference, also the `servers:` entry. Default `server`.
`--models <ID,ID,...>`	list	Model identifiers for the `model_compatibility:` matrix. Omit to use the built-in default lineup.
`--no-edge`	flag	Skip the boundary edge case per tool.
`--no-violation`	flag	Skip the schema-violation case per tool.
`--output <PATH>`	path	Write the suite to a file instead of stdout.
`--update <PATH>`	path	Merge into an existing suite, keeping every hand-authored test and appending only tests whose `name` is new. Mutually exclusive with `--output`.

Why a file, not a live connection. Reading tools from a file keeps generation deterministic and CI-reproducible. Live tools/list introspection reuses the same connector as generate stubs --url and is the planned follow-up.

Status. Working (stubs and suite).

`mock`

mcptest mock --tools-from PATH

Description. Spawn a YAML-driven stdio mock MCP server. The mock loads its tool catalog from --tools-from and serves it over stdio, so a client integration can be exercised without the real backend. v1.0 ships stdio only.

Arguments.

Argument	Type	Description
`--tools-from <PATH>`	path	A YAML manifest (`mock_server.tools[]`) or a `tools/list` baseline JSON snapshot.

Status. Working. The cassette-driven mock (--cassettes) is a separate draft, see mcptest-mock.md.

`exec`

mcptest exec --connection-server [--config PATH] [--ipc-version VERSION]
             [--no-cache] [--no-cassette] [--record-cassettes]
             [--debug-output PATH] [--verbose]

Description. Run mcptest as an IPC co-process for the native SDKs. The SDK (pytest, Vitest, Go, etc.) spawns this command, pipes newline-delimited JSON-RPC over stdin/stdout, and reads canonical responses back. You do not run this by hand; the language SDKs invoke it.

Arguments.

Argument	Type	Description
`--connection-server`	flag	Required mode toggle (the only mode in v1).
`--config <PATH>`	path	mcptest config. Defaults to `mcptest.yaml`.
`--ipc-version <VERSION>`	string	Pin the IPC envelope version. The dispatcher rejects a version newer than it supports.
`--no-cache`	flag	Disable the cache for this session.
`--no-cassette`	flag	Disable the cassette layer (no replay, no record).
`--record-cassettes`	flag	Record new cassettes for any unmocked call seen this session.
`--debug-output <PATH>`	path	Write a verbatim transcript for SDK debugging.
`--verbose`	flag	Emit envelope counts and lifecycle events on stderr.

Status. Working.

`login`

mcptest login [SERVER] [--url URL] [--client-id ID] [--all] [--force] [--no-browser]

Description. Interactive OAuth 2.1 + PKCE login that caches a token for later runs. Discovers the IdP from the target's .well-known/oauth-authorization-server, runs the browser flow against a loopback listener, and caches the token (and any Dynamic Client Registration metadata) for subsequent mcptest run invocations.

Arguments.

Argument	Type	Description
`[SERVER]`	string	Named server from `mcptest.yml` to log in to. Mutually exclusive with `--url` and `--all`. Omit for the single configured URL server or an interactive picker.
`--url <URL>`	string	Authenticate against a URL not recorded in `mcptest.yml`. Conflicts with `--all`.
`--client-id <ID>`	string	Fallback OAuth `client_id` for IdPs without a `registration_endpoint`. Ignored when DCR is available.
`--all`	flag	Log in to every configured URL server in declaration order. Stdio servers are skipped with a warning.
`--force`	flag	Clear the cached token and DCR metadata before running the flow.
`--no-browser`	flag	Print the authorization URL on stdout instead of opening a browser. The loopback listener still accepts the callback. CI and headless escape hatch.

Status. Working.

`prompt`

mcptest prompt [--output PATH]

Description. Print a copy-paste-ready grounding prompt for an LLM assistant writing mcptest YAML. Prints to stdout by default.

Arguments.

Argument	Type	Description
`--output <PATH>`	path	Write the prompt to a file instead of stdout.

Status. Working.

`cache`

mcptest cache [--cache-dir PATH] <list|stats|clear|prune>
mcptest cache clear [--older-than DURATION]

Description. Inspect or evict the local cache store. The --cache-dir override points the store at a non-default root and is accepted on every cache subcommand.

Subcommands.

Subcommand	Description
`list`	List every cached entry with size and age.
`stats`	Print totals plus the hit-rate row.
`clear [--older-than DURATION]`	Remove entries. Without `--older-than`, removes everything. Duration like `30m`, `2h`, `7d`.
`prune`	Remove entries older than 30 days. CI-friendly alias so scripts do not pick a number.

Status. Working.

`security`

mcptest [GLOBAL_OPTIONS] security <SNAPSHOT> [--format FORMAT] [--fail-on SEVERITY]

Description. Scan a tools/list-style JSON snapshot with the bundled deterministic security checks and report the findings. No model decides a verdict: every check is a regex or structural predicate over the tool, prompt, and resource definitions, so a finding is reproducible. The first bundled lane is the tool-surface family (SEC-001 through SEC-009); see the security test catalog.

Arguments.

Argument	Type	Required	Description
`<SNAPSHOT>`	filesystem path	yes	A JSON snapshot that may carry `tools`, `prompts`, and `resources` arrays.
`--format <FORMAT>`	enum `{pretty, json, sarif, html, md}`	optional, default `pretty`	Output format. SARIF 2.1.0 drops into code scanning; `html`/`md` emit a reviewer-grade vulnerability report with an OWASP LLM Top 10 coverage table (see security-report.md).
`--fail-on <SEVERITY>`	enum `{info, low, medium, high, critical}`	optional, default `high`	Exit code `1` when any finding is at or above this severity.

Examples.

# Human-readable summary.
mcptest security tools-list.json

# Hard fail in CI on any high or critical finding.
mcptest security tools-list.json --fail-on high

# SARIF for code scanning.
mcptest security tools-list.json --format sarif > security.sarif

Exit codes. 0 when nothing fires at or above --fail-on, 1 when something does, 2 when the snapshot cannot be read or parsed.

Subcommands.

security redteam drives the live red-team corpus against a running server (advisory only, never the verdict).
security import folds an external scanner's output into a unified report (see below).

Status. Working. The engine lives in mcptest_core::security; the active probes and the integrity, namespace, and advisory lanes are tracked under the security-framework epic.

`security import`

mcptest security import [--sarif FILE]... [--snyk FILE]... [--supplement FILE]... \
    [--snapshot FILE] [--advisory] [--format FORMAT] [--fail-on SEVERITY]

Description. mcptest owns the ingest, not the scan. import normalizes a scanner you already run into the finding vocabulary, dedups it against the bundled catalog (an overlapping SEC rule is counted once), and prints one unified report. SARIF 2.1.0 is read with --sarif, Snyk agent-scan JSON with --snyk, and any other JSON shape (a top-level array or a findings array) with --supplement. Each flag is repeatable. With --snapshot, the bundled deterministic lanes also run and the imports fold in beside them. See the external-scanner supplement.

Arguments.

Argument	Type	Required	Description
`--sarif <FILE>`	filesystem path, repeatable	one of the three	A SARIF 2.1.0 log. The scanner name is read from its tool driver.
`--snyk <FILE>`	filesystem path, repeatable	one of the three	A Snyk agent-scan `ScanPathResult` JSON file.
`--supplement <FILE>`	filesystem path, repeatable	one of the three	A generic scanner JSON file.
`--snapshot <FILE>`	filesystem path	optional	A `tools/list` snapshot to also scan with the bundled lanes.
`--advisory`	flag	optional	Mark every import advisory, so none of it gates.
`--format <FORMAT>`	enum `{pretty, json, sarif, html, md}`	optional, default `pretty`	Output format. `html`/`md` emit a vulnerability report with OWASP coverage (see security-report.md).
`--fail-on <SEVERITY>`	enum `{info, low, medium, high, critical}`	optional, default `high`	Exit `1` when any counted finding is at or above this floor.

Examples.

# Fold an AgentSeal SARIF file and a Snyk agent-scan JSON file into one report.
mcptest security import \
  --sarif examples/security/agentseal.sarif.json \
  --snyk examples/security/snyk-agent-scan.json

# Combine an import with a bundled snapshot scan, emitting SARIF.
mcptest security import --sarif scan.sarif \
  --snapshot tools-list.json --format sarif > security.sarif

Exit codes. 0 when nothing counted fires at or above --fail-on, 1 when something does, 2 when a file cannot be read or no scanner file is given.

`sbom`

mcptest [GLOBAL_OPTIONS] sbom [--format FORMAT] [--out PATH] [--verify]

Description. Print the CycloneDX 1.5 Software Bill of Materials that the build script baked into the binary at compile time, list licenses, or verify the embedded blob has not been swapped at runtime. The full guide lives at Software Bill of Materials.

Arguments.

Argument	Type	Required	Description
`--format <FORMAT>`	enum `{cyclonedx, licenses, names}`	optional, default `cyclonedx`	`cyclonedx` is the raw embedded JSON; `licenses` is one line per dep with its SPDX expression; `names` is one line per dep with just name and version.
`--out <PATH>`	filesystem path	optional	Write the output here instead of stdout.
`--verify`	flag	optional	Re-hash the embedded BOM at runtime, compare to the build-time SHA, exit 0 on match and 2 on mismatch.

Examples.

# Pipe the BOM into a scanner.
mcptest sbom > mcptest.cdx.json

# Quick license inventory.
mcptest sbom --format licenses

# Confirm the embedded blob has not been tampered with at runtime.
mcptest sbom --verify

Exit codes. 0 on success or successful verification, 2 when --verify detects a hash mismatch.

Status. Working.

`evidence`

mcptest [GLOBAL_OPTIONS] evidence <REPORT> [--security FILE] [--reproducible] [--out PATH] [--sign]
mcptest [GLOBAL_OPTIONS] evidence verify <EVIDENCE> [--max-age DURATION] [--signature FILE] [--require-signed]

Description. Aggregate a mcptest run --format json report into a portable evidence artifact (server identity, spec version, corpus hash, source provenance, grades, reproducibility), or verify one. --sign reuses the release Sigstore cosign path to attach a detached signature. See portable run evidence.

Arguments (emit).

Argument	Type	Required	Description
`<REPORT>`	filesystem path	yes (unless a subcommand)	A serialized `mcptest run --format json` report. Must carry run metadata.
`--security <FILE>`	filesystem path	optional	A `mcptest security --format json` report whose severity counts fold into the grades.
`--reproducible`	flag	optional	Mark the run byte-reproducible (the `sbom --verify` / `SOURCE_DATE_EPOCH` parity signal).
`--out <PATH>`	filesystem path	optional	Write the artifact here instead of stdout. Required with `--sign`.
`--sign`	flag	optional	Sign the artifact with `cosign sign-blob` (keyless, GitHub OIDC), writing `<out>.sig` and `<out>.cert`. Requires `cosign` on PATH.

Arguments (verify).

Argument	Type	Required	Description
`<EVIDENCE>`	filesystem path	yes	The `evidence.json` artifact to verify.
`--max-age <DURATION>`	duration (`720h`, `30m`)	optional	Reject evidence whose `generated_at` is older than this.
`--signature <FILE>`	filesystem path	optional	Detached signature; defaults to `<evidence>.sig` when present.
`--require-signed`	flag	optional	Reject the artifact when it is unsigned.

Examples.

# Emit an artifact from a run, folding in a security scan.
mcptest evidence run.json --security security.json --reproducible --out evidence.json

# Sign it (needs cosign on PATH).
mcptest evidence run.json --out evidence.json --sign

# Verify: reject stale (>30d), forked, or unsigned evidence.
mcptest evidence verify evidence.json --max-age 720h --require-signed

Exit codes. Emit: 0 on success, 2 when the report cannot be read or carries no metadata (or --sign cannot run). Verify: 0 accepted, 1 rejected (reasons printed), 2 when the artifact cannot be read.

Status. Working. Cryptographic Sigstore verification (Rekor inclusion, certificate identity) is cosign verify-blob's job; evidence verify owns the freshness, commit-ancestry, and signature-presence policy.

`ledger`

mcptest [GLOBAL_OPTIONS] ledger emit <ENVELOPE> [--session-id ID] [--output PATH]
mcptest [GLOBAL_OPTIONS] ledger diff <BASELINE> <ACTUAL> [--max-diff N]

Description. Turn a saved agent run envelope into a session-ledger NDJSON file, or diff an actual ledger against a baseline trajectory. The ledger is the append-only, structured record of the tool calls a run made: one header record, then one tool_call record per call, in call order. See session ledger for the schema and field reference.

Arguments (emit).

Argument	Type	Required	Description
`<ENVELOPE>`	filesystem path	yes	A JSON file holding an agent run envelope with a `tool_calls` array (each entry has `name`, `server`, `args`). This is the shape a single agent test produces in `mcptest run --reporter json`.
`--session-id <ID>`	string	optional	Session id stamped on every record.
`--output <PATH>`	filesystem path	optional	Write the ledger here instead of stdout.

Arguments (diff).

Argument	Type	Required	Description
`<BASELINE>`	filesystem path	yes	The recorded baseline ledger NDJSON.
`<ACTUAL>`	filesystem path	yes	The fresh ledger to compare.
`--max-diff <N>`	integer	optional (default `0`)	Maximum tolerated divergent tool calls. The command exits non-zero once divergences exceed this; `0` requires an exact match.

Examples.

# Record a baseline trajectory from a saved envelope.
mcptest ledger emit envelope.json --session-id run-42 --output baseline.ndjson

# Gate a fresh run against the baseline in CI (exact match).
mcptest ledger diff baseline.ndjson actual.ndjson --max-diff 0

The diff compares tool calls position by position per agent_id: a different tool at a hop is a remove plus an add, a matching tool with different params is a param change.

  - removed  hop 1: get_weather
  + added    hop 1: delete
ledger diff: 2 divergence(s) exceed --max-diff 0

Exit codes. 0 clean (or divergences within --max-diff), 1 divergences exceed --max-diff, 2 when an input cannot be read.

Status. Working. The schema is owned here; see session ledger for the open-core boundary.

`web-bot-auth`

mcptest [GLOBAL_OPTIONS] web-bot-auth directory [--key PATH | --key-env VAR] [--algorithm ALG] [--agent URL]

Description. Emit the .well-known/http-message-signatures-directory JWK Set for a Web Bot Auth signing key. Only the public key is printed; the private key is never written to the output. See Web Bot Auth for the full signing and verification story.

Arguments (directory).

Argument	Type	Required	Description
`--key <PATH>`	filesystem path	one of `--key`/`--key-env`	PKCS#8 PEM file holding the private signing key. Only the derived public key is emitted.
`--key-env <VAR>`	env var name	one of `--key`/`--key-env`	Env var holding the PKCS#8 PEM private key, so the key never appears on the command line.
`--algorithm <ALG>`	`ed25519` or `rsa-pss`	optional (default `ed25519`)	Signature algorithm. Must match the key type.
`--agent <URL>`	URL	optional	`Signature-Agent` URL identifying the bot. Recorded in the validated config; it does not appear in the JWK Set itself.

Examples.

# Emit the public JWK Set for an Ed25519 key.
mcptest web-bot-auth directory --key bot.ed25519

# Read the key from an env var instead of a file.
mcptest web-bot-auth directory --key-env BOT_SIGNING_KEY

Exit codes. 0 on success, non-zero when the key is missing or malformed.

Status. Working.

Exit codes

mcptest uses a small, stable set of exit codes so CI scripts can react without parsing stdout. Every subcommand documents which codes it can return; this table is the central reference.

Code	Meaning	Source
`0`	Success. The command did what it was asked to do.	All subcommands.
`1`	Test failures or a malformed input artefact.	`run`, `report` (bad input), `diff` (breaking change with `--fail-on-breaking true`), `eval` (failing verdict), `compliance` (regression vs baseline), `model-compat` (FAIL).
`2`	Configuration error or invalid arguments.	`validate`, `init` (write conflict), `report` (collector rejection), `run` (config load failed). clap also returns `2` for unknown flags.
`3`	`--wait-for-ready` budget expired before a URL server accepted connections.	`run`, `doctor`.
`5`	Cost cap exceeded, or `run --update-snapshots` refused under `CI=true`.	`eval`, `run`.
`6`	Coverage below threshold, or a model-compat DRIFT.	`coverage` and `run --coverage-threshold`, `model-compat` (DRIFT).
`7`	No tests selected. The suite is empty, or `--filter`, `--shard`, or `--last-failed` matched nothing, and `--pass-with-no-tests` was not passed.	`run`.

Codes outside this set are reserved. If you see one, it is almost certainly clap returning 2 for a parse error.

Note: a future doctor --lint-descriptions quality lint will land its own exit code when that feature ships; it is not wired in the v1.0 binary today.

Cross-references

getting-started.md: five-minute walkthrough that installs the binary, scaffolds a project, and runs the first test.
yaml-reference.md: every field in the YAML test format, the schema, and worked examples.
troubleshooting.md: common failure modes, what each exit code means in practice, and how to diagnose a stuck run.
crates/mcptest/src/cli/: the source of truth for every flag on this page (cli/mod.rs for the command tree, cli/args/ for each subcommand's flags). If this doc disagrees with the source, the source wins; please file a ticket against the docs.
schemas/v1.json: the JSON Schema emitted by mcptest schema.
schemas/wire/v0.json: the upload envelope shape consumed by mcptest report --format upload.

CLI reference

Synopsis

Global options

Output and logging

--no-color

--debug

--verbose

Logging

--log-level <LEVEL>

Filter resolution precedence

Other logging knobs

What gets logged

--quiet

--reporter <FORMAT>

--output <PATH>

--annotations <WHEN>

--color <WHEN>

Configuration sources

--config <PATH>

--env-file <PATH>

--no-env-file

--var KEY=VALUE

--show-secrets

Test selection and execution

--filter <EXPR>

--parallel <N>

--timeout <SECONDS>

--retry <N>

--watch

--wait-for-ready[=DURATION]

Server target overrides

--server-url <URL>

--server-command <CMD>

--server-auth-bearer-env <NAME>

--server-config <PATH>

HTTP transport

--header NAME=VALUE

--header-env NAME=VAR_NAME

--insecure-skip-verify

--ca-bundle <PATH>

--http-timeout <SECONDS>

--connect-timeout <SECONDS>

Proxy

--proxy <URL>

--http-proxy <URL>

--https-proxy <URL>

--no-proxy

--noproxy <HOSTLIST>

Upload reporter

--upload-endpoint <URL>

--upload-token-env <NAME>

--upload-organization <NAME>

Subcommands

run

baseline

conformance

init

doctor

validate

schema

coverage

report

eval

diff

lint

migrate

discover

completions

model-compat

compliance

pipe

tools, resources, prompts, capabilities

inspect

mcp-server

generate

mock

exec

login

prompt

cache

`--no-color`

`--debug`

`--verbose`

`--log-level <LEVEL>`

`--quiet`

`--reporter <FORMAT>`

`--output <PATH>`

`--annotations <WHEN>`

`--color <WHEN>`

`--config <PATH>`

`--env-file <PATH>`

`--no-env-file`

`--var KEY=VALUE`

`--show-secrets`

`--filter <EXPR>`

`--parallel <N>`

`--timeout <SECONDS>`

`--retry <N>`

`--watch`

`--wait-for-ready[=DURATION]`

`--server-url <URL>`

`--server-command <CMD>`

`--server-auth-bearer-env <NAME>`

`--server-config <PATH>`

`--header NAME=VALUE`

`--header-env NAME=VAR_NAME`

`--insecure-skip-verify`

`--ca-bundle <PATH>`

`--http-timeout <SECONDS>`

`--connect-timeout <SECONDS>`

`--proxy <URL>`

`--http-proxy <URL>`

`--https-proxy <URL>`

`--no-proxy`

`--noproxy <HOSTLIST>`

`--upload-endpoint <URL>`

`--upload-token-env <NAME>`

`--upload-organization <NAME>`

`run`

`baseline`

`conformance`

`init`

`doctor`

`validate`

`schema`

`coverage`

`report`

`eval`

`diff`

`lint`

`migrate`

`discover`

`completions`

`model-compat`

`compliance`

`pipe`

`tools`, `resources`, `prompts`, `capabilities`

`inspect`

`mcp-server`

`generate`

`mock`

`exec`

`login`

`prompt`

`cache`

`security`

`security import`

`sbom`

`evidence`

`ledger`

`web-bot-auth`