mcptest docs GitHub

YAML test format reference (v1)

A complete walk-through of every field mcptest reads from a test YAML file. The JSON Schema at schemas/v1.json is the source of truth. This page documents the v1 surface the schema actually validates today and flags the fields that are in flight or deferred so you can plan around them without guessing.

Overview

An mcptest configuration is a single YAML file (or a small tree of files joined via imports) that describes:

  1. Which MCP servers to talk to.
  2. Which tool calls, compliance checks, and evaluations to run.
  3. How to judge the responses.

Every file should start with the YAML language server directive so editors that understand JSON Schema can autocomplete fields, surface inline help, and flag typos before you run the test:

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json

The CLI runs the same schema at load time. mcptest validate <file> loads the YAML, applies schemas/v1.json via the jsonschema crate, and prints errors with JSON pointers into the offending fields. The loader rejects unknown keys at the top level and inside every nested object that uses additionalProperties: false in the schema, so a typo in varables: fails the run instead of silently doing nothing.

A minimal file looks like this:

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json

servers:
  local:
    command: ["./target/debug/my-mcp-server"]

tools:
  - name: "lists tools without error"
    server: local
    tool: "list_directory"
    args:
      path: "/tmp"
    expect:
      - target: "result.content"
        matcher:
          schema:
            type: array
            minItems: 1

The rest of this page walks each block top down. Every field entry lists type, whether it is required, the default, a short snippet, and links to related sections.

Top-level keys

KeyRequiredPurpose
serversyesNamed MCP servers under test. See servers block.
importsnoOther YAML files to merge in. See imports block.
variablesnoAuthor-defined variables for ${name} interpolation. See variables block.
toolsnoArray of tool-call tests. See tools block.
resourcesnoArray of resource-read tests. See resources block.
promptsnoArray of prompt-get tests. See prompts block.
agentsnoAgent end-to-end tests against one or more models. See agents block.
faultsnoNamed fault injections an agent test references via inject:. See faults block.
providersnoCustom OpenAI-compatible provider definitions. See providers block.
budgetnoPer-test and per-suite spend caps for agent runs. See budget block.
compliancenoProtocol-level checks against the server. See compliance block.
evalsnoRubric or model-graded evaluations. See evals block.
rubricsnoReusable named rubrics referenced from evals via rubric: { ref: <name> }. See evals block.
model_compatibilitynoMetadata list of model identifiers the suite targets. See model_compatibility block.
performancenoPer-test and suite-wide latency budgets. See performance block.
target_versionsnoRun the suite against more than one MCP protocol revision. See target_versions block.

Anything else at the top level fails validation. If you need a custom field for a future feature, file an issue before adding it.

target_versions block

Type: array of protocol-revision strings. Optional. .

A suite that lists target_versions: declares that it wants to run against every listed MCP revision in turn, not just the runner's default. A server in a deprecation window can be validated against both the old and the new contracts in a single command, and the reporter aggregates per-version pass/fail.

target_versions:
  - "2025-11-25"
  - "2026-07-28"

servers:
  fs:
    command: ["./fs-server"]

The accepted values are the wire strings listed in docs/stateless-transport.md: 2024-11-05, 2025-03-26, 2025-06-18, 2025-11-25, 2026-03-26, 2026-07-28. An unknown value is a load-time error so a typo never silently selects a session-based fallback. Duplicates are de-duped while preserving declaration order so the reporter prints per-version results in the order the suite listed them.

CLI override. mcptest run --target-version <V> selects a single entry from the suite's list, so a CI matrix can keep one job per protocol revision without sharding the suite. The override must parse via ProtocolVersion::from_wire and, when the suite declares a list, must be one of its entries; mismatched overrides fail with a pointer at the declared list. When the suite declares no list, the override still works as a one-off "run against this revision only" knob.

Selector library. The pure mcptest_core::target_versions::effective(suite_versions, cli_override) function returns the list the runner walks. The runner integration that calls it (per-version fan-out + per-version report aggregation) is the next focused commit on this work; the selector ships today so the CLI flag and the integration land in isolation.

servers block

Type: object, required, at least one entry.

Each key under servers is the name a test block refers to via its server: field. The value is one of three shapes: a subprocess specification (command:), a URL specification (url:), or a cassette specification (cassette:) that replays a recording instead of reaching a live server. Exactly one of the three must be present; the schema enforces this with oneOf.

Subprocess (stdio) servers

Use a subprocess server when the MCP server is a local binary, an npx package, or anything that speaks MCP over stdio.

servers:
  filesystem:
    command: ["npx", "-y", "@modelcontextprotocol/server-filesystem", "/tmp"]
    env:
      LOG_LEVEL: "info"

Fields:

Snippet for a Rust binary built in the workspace:

servers:
  local:
    command: ["./target/debug/my-mcp-server", "--config", "./fixtures/test.toml"]
    env:
      RUST_LOG: "debug"

See also: Variable interpolation, tools block.

CLI server-target overrides

Four global CLI flags rewrite the server: block at runtime so the same YAML can target a freshly-deployed preview or CI environment whose URL or command is not knowable at authoring time. The flags live on the global parser, so they go before the subcommand: mcptest --server-url https://preview run.

FlagEffect
--server-url <URL>Replace server.url. If the YAML had a command:, it is removed. Mutually exclusive with --server-command.
--server-command <CMD>Replace server.command. <CMD> is split with POSIX shell rules (shell-words), so --server-command "./dev --debug" parses to ["./dev", "--debug"]. Mutually exclusive with --server-url.
--server-auth-bearer-env <NAME>Set server.auth.bearer_token_env for URL targets.
--server-config <PATH>Load a YAML file containing a full server: block and use it in place of the in-suite block.

Precedence (lowest to highest):

  1. The YAML file's own server: block.
  2. --server-config (full-block replacement).
  3. --server-url, --server-command, --server-auth-bearer-env (single-field overrides). A single-field flag wins over the same field in --server-config.

When any override flag is set, the runner prints a one-line banner naming the flags before it runs so the operator can spot a misconfigured CI job without re-reading the YAML.

URL (HTTP or SSE) servers

Use a URL server when the MCP server is already running, lives in a container, or is reachable over the network.

servers:
  remote_api:
    url: "https://mcp.example.com/v1"
    auth:
      bearer_token_env: "MCPTEST_REMOTE_API_TOKEN"
    wait_for_ready: "https://mcp.example.com/healthz"

Fields:

The headers and mtls keys are reserved on the auth object for a future release and later. They are not validated against any shape in v1, and the loader prints a warning if it sees them. See the deferred notes in auth block.

Container-managed servers are documented briefly only. Full container support (lifecycle, networking, healthchecks) is a future release. Until then, run your container yourself and point a URL server at it.

See also: auth block, Variable interpolation.

Cassette (replay) servers

Use a cassette server to replay a recorded MCP server instead of reaching a live one. The run never touches the network: every request is matched against the recorded exchanges and the recorded response is returned. This is the offline, deterministic, key-free path for CI.

servers:
  recorded:
    cassette: ./cassettes/issues-server.json

Fields:

cassette: is mutually exclusive with command: and url:; setting more than one is rejected at load time. See Cassettes for the file format and how matching works.

auth block

Authentication strategy for URL servers. Exactly one of bearer_token_env or oauth must be present in v1.

Bearer token from env

servers:
  remote_api:
    url: "https://mcp.example.com/v1"
    auth:
      bearer_token_env: "MCPTEST_REMOTE_API_TOKEN"

Fields:

Note the distinction between bearer_token_env: NAME and ${NAME}. The former tells mcptest "read this env var into the Authorization header." The latter is generic string interpolation usable in any string field. Use bearer_token_env for auth credentials; the loader applies a redaction pass to those values in reporter output. Use ${NAME} for everything else. This is covered again in Variable interpolation.

OAuth 2.0 client credentials

servers:
  remote_api:
    url: "https://mcp.example.com/v1"
    auth:
      oauth:
        client_id_env: "MCPTEST_OAUTH_CLIENT_ID"
        authorization_url: "https://auth.example.com/oauth/authorize"
        token_url: "https://auth.example.com/oauth/token"
        scopes:
          - "mcp:read"
          - "mcp:invoke"

Fields under oauth:

Deferred auth fields

See also: Variable interpolation, tools block.

Custom headers and HTTP transport

URL servers accept two optional blocks that configure the HTTP transport without touching auth:. The server.headers block sends custom headers on every request, and the server.http block tunes timeouts, redirect handling, and basic TLS knobs. mTLS (client certificates) is deferred to a future release; the model carries verify on/off, CA bundle, and minimum protocol version only.

server.headers

Type: object, optional, default {}.

Each entry maps a header name to a value. The value is either a literal string (with ${VAR} interpolation against variables) or an object {env: NAME} that the runner reads from the environment at connect time. Pick the env form for secret-flavored headers so they never land in the YAML file.

servers:
  saas_api:
    url: "https://api.example.com/mcp"
    headers:
      X-Tenant: "acme"
      X-API-Key:
        env: ACME_TENANT_API_KEY
      X-Trace-Id: "${request_id}"

Fields per entry:

Authorization and Proxy-Authorization are rejected here. Use the auth: block for bearer tokens and OAuth, see auth block. Header names must be valid RFC 7230 token characters (letters, digits, and the punctuation set ! # $ % & ' * + - . ^ _ | ~`). Names with spaces or colons fail validation.

server.http

Type: object, optional, default {}.

Controls HTTP transport behavior for URL servers. Stdio servers ignore the block. Duration fields take a number with a suffix: 30s, 500ms, 1m, 1h. Bare numbers without a unit are rejected so a future reader can tell what the value means.

servers:
  saas_api:
    url: "https://api.example.com/mcp"
    http:
      timeout: 30s
      connect_timeout: 5s
      max_redirects: 5
      user_agent_override: true
      tls:
        insecure_skip_verify: false
        ca_bundle_path: "/etc/ssl/internal-ca.pem"
        min_version: "1.2"

Fields:

Fields under tls:

CLI flags for headers and HTTP transport

Five global CLI flags map to the same surface so the same YAML can target a preview environment without editing the file. All flags are repeatable where applicable.

FlagEffect
--header NAME=VALUEAppend a literal header. Repeatable. Rejects Authorization and Proxy-Authorization.
--header-env NAME=VARAppend an env-backed header. Repeatable. Reads VAR from the environment at connect time.
--insecure-skip-verifyDisable TLS verification (loud WARNING banner).
--ca-bundle PATHPath to a PEM-encoded CA bundle.
--http-timeout SECONDSOverride server.http.timeout.
--connect-timeout SECONDSOverride server.http.connect_timeout.

Precedence (lowest to highest):

  1. server.headers and server.http from the YAML file.
  2. --header, --header-env, --ca-bundle, --http-timeout, --connect-timeout, --insecure-skip-verify. CLI flags overwrite YAML values per header name (case-insensitive).

Worked examples

Cloudflare Access:

servers:
  protected:
    url: "https://mcp.internal.example.com/v1"
    headers:
      CF-Access-Client-Id:
        env: CF_ACCESS_CLIENT_ID
      CF-Access-Client-Secret:
        env: CF_ACCESS_CLIENT_SECRET

Google Cloud IAP:

servers:
  gcp:
    url: "https://iap.example.com/mcp"
    headers:
      Proxy-Authorization-IAP:
        env: GCP_IAP_TOKEN
    http:
      timeout: 60s

AWS API Gateway with an API key:

servers:
  awsg:
    url: "https://abc123.execute-api.us-east-1.amazonaws.com/prod/mcp"
    headers:
      x-api-key:
        env: AWSG_API_KEY

Multi-tenant SaaS with a tenant header:

servers:
  saas:
    url: "https://api.example.com/mcp"
    headers:
      X-Tenant: "acme"
      X-Plan: "enterprise"

Distributed tracing:

servers:
  traced:
    url: "https://api.example.com/mcp"
    headers:
      X-Trace-Id: "${run_id}"
      traceparent: "${otel_traceparent}"

Internal CA bundle (corporate intranet endpoint):

servers:
  internal:
    url: "https://mcp.corp.example.com/v1"
    http:
      tls:
        ca_bundle_path: "/etc/ssl/corp-internal-ca.pem"
        min_version: "1.3"

mTLS (client certificates) is deferred to a future release; this section will gain an mtls: block alongside tls: when that work lands.

variables block

Type: object, optional, default {}.

variables holds author-defined values usable inside any string field via ${name} interpolation. Each entry is either a literal value or a reference from_env, never both.

variables:
  fixture_path:
    value: "/tmp/mcptest-fixture.txt"

  account_id:
    from_env: "MCPTEST_ACCOUNT_ID"
    default: "acct_demo"

Fields per entry:

Resolution precedence is documented at mcptest.sh/docs/secrets-and-variables. The short version, in order from highest to lowest precedence:

  1. Process environment variables (export VAR=... or whatever your shell or CI runner set up).
  2. Values written to a loaded dotenv file (.env next to the test file, or a file passed via --env-file).
  3. The default: field on a from_env variable.
  4. The value: field on a literal variable.

Variables resolve once when the configuration loads. References that fail to resolve (env var unset, no default) raise a structured error before any test runs.

variables:
  api_base:
    value: "https://staging.example.com"

  auth_token:
    from_env: "MCPTEST_TOKEN"
    default: "dev-only-token"

tools:
  - name: "fetches a resource"
    server: remote_api
    tool: "fetch"
    args:
      url: "${api_base}/resources/42"

See also: Variable interpolation, servers block.

tools block

Type: array of objects, optional, default [].

Each entry is a tool-call test: invoke a single tool on a named server and run assertions against the result.

tools:
  - name: "lists tools without error"
    server: filesystem
    tool: "list_directory"
    args:
      path: "/tmp"
    expect:
      - target: "result.content"
        matcher:
          schema:
            type: array
            minItems: 1
    timeout_ms: 5000

Fields per entry:

Parametric inputs (a data: array that runs the same test once per row) are deferred. The intended shape is one entry per row, with each row substituted into the args block via ${data.field} interpolation. Until that lands, duplicate the test or use a small shell loop.

A worked example using both literal and env-backed variables:

variables:
  greeting:
    value: "hello"
  who:
    from_env: "MCPTEST_GREET_WHO"
    default: "world"

tools:
  - name: "echoes a greeting"
    server: local
    tool: "echo"
    args:
      message: "${greeting}, ${who}"
    expect:
      - target: "result.content[0].text"
        matcher:
          exact: "hello, world"
        message: "echo should round-trip the rendered message"

See also: expect block, variables block, Test styles.

tags (optional, on every test type)

Type: array of strings. Optional. Available on tool tests, resource tests, prompt tests, and agent tests.

tools:
  - name: search returns at least one result
    server: prod-api
    tool: search
    args:
      query: "weather"
    tags: ["smoke", "search"]
    expect:
      - target: "result.content"
        matcher:
          contains: "weather"

mcptest run --tag smoke keeps tests whose tag list contains smoke. mcptest run --skip-tag slow drops tests whose tag list contains slow. Multiple --tag flags are OR'd; --skip-tag wins when a test matches both.

For agent tests the tag applies to every (agent, model) row the matrix expands to, so --tag smoke keeps or drops every model under that agent as a unit. See cli-reference.md for the full flag matrix.

transform (optional, on tool tests)

Type: object with optional request and response command strings. Optional. Tool tests only.

A transform is a subprocess that rewrites the outbound request before it is sent, the response before assertions run, or both. The command reads one JSON value on stdin and writes the replacement value on stdout, so plain jq, node, or python work without any framing protocol.

tools:
  - name: search with normalized response
    server: prod-api
    tool: search
    args:
      query: "weather"
    transform:
      request: jq '.arguments.query |= ascii_downcase'
      response: ./transforms/strip-ids.sh
    expect:
      - target: result.content[0].text
        matcher:
          icontains: weather

The request command receives { "name": ..., "arguments": ... }; the response command receives { "result": ... }. A non-zero exit, a timeout, or unparseable stdout fails the test. You can set a default transform under defaultTest: and override it per test. See transforms.md for the full contract, the environment context, and worked examples in jq, Node, and Python.

resources block

Type: array of objects, optional, default [].

Each entry is a resource-read test: read a single resource by URI from a named server (resources/read) and run assertions against the result envelope.

resources:
  - name: "readme resource is text"
    server: filesystem
    uri: "file:///workspace/README.md"
    expect:
      - target: "result.contents[0].mimeType"
        matcher:
          icontains: "text"

Fields per entry:

prompts block

Type: array of objects, optional, default [].

Each entry is a prompt-get test: fetch a single prompt by name from a named server (prompts/get) and run assertions against the result envelope.

prompts:
  - name: "bug-triage prompt renders"
    server: filesystem
    prompt: bug_triage
    args:
      severity: high
    expect:
      - target: "result.messages[0].content.text"
        matcher:
          icontains: "triage"

Fields per entry:

Both run against any MCP server, including the built-in mcptest mock, which serves resources/* and prompts/* from its manifest.

agents block

Type: array, optional. Each entry runs a real LLM against one or more MCP servers and asserts against the resulting conversation trace. Background and design rationale live in docs/models.md and docs/concepts.md.

agents:
  - name: weather query routes to get_weather
    # Either `model:` (singleton) or `models:` (matrix).
    models:
      - claude-sonnet-4-5                    # auto-detect family
      - { provider: openrouter, id: openai/gpt-4o }   # named provider
    servers: [weather]                                # one or more MCP servers
    prompt: What is the weather in Sacramento?
    system_prompt: |                                   # optional
      You are a weather assistant.
    max_turns: 10                                      # optional, default 10
    max_tokens: 1024                                   # optional, default 1024 (per call)
    token_budget: 50000                                # optional, cumulative tokens across the run
    max_tool_calls: 20                                 # optional, total tool calls across the run
    time_budget: 30s                                   # optional, wall-clock deadline for the run
    expect:
      - target: tool_calls[0].name
        matcher: { exact: get_weather }
      - target: tool_names                             # the whole trajectory, in order
        matcher: { contains-all: [get_weather] }
      - target: redundant_tool_calls                   # no repeated (name, args) calls
        matcher: { exact: 0 }
      - target: conversation.tokens.total
        matcher: { regex: "^[0-9]+$" }

Per-entry fields:

FieldRequiredPurpose
nameyesTest name shown in the reporter.
model or modelsyesExactly one. model is sugar for models: [<value>]. Each entry under models: is either a bare string (auto-detected family) or { provider: <name>, id: <model> } referencing a providers: entry.
serversyesAt least one server key from the top-level servers: map. The driver exposes the merged tool catalog from every named server to the model.
promptyes*User prompt that kicks off the conversation. *Optional when a turns: block supplies the user side (see Multi-turn conversations).
turnsnoDrive a multi-turn conversation instead of a single prompt: a scripted list of { user: ... } turns, or a { simulate: ... } block where an LLM plays the user. See Multi-turn conversations.
system_promptnoOptional system prompt.
max_turnsnoHard cap on tool-use iterations. Defaults to 10.
max_tokensnoPer-call max_tokens for the model. Defaults to 1024.
token_budgetnoCumulative token budget across the whole run. Distinct from per-call max_tokens: this sums every turn's usage and stops the run once the total crosses the cap. Omit for an unbounded run.
max_tool_callsnoCap on total tool invocations across the run. The run stops once the count crosses the cap. Omit for an unbounded run.
time_budgetnoWall-clock deadline for the run as a human duration string (30s, 2m, 1h, 500ms). The run stops at the next loop boundary once the deadline passes. Omit for an unbounded run.
injectnoNames of top-level faults to inject into this run. A call matching an injected fault is synthesized as an unresponsive timeout.
recoverynoRecovery gate scored against the injected fault: max_detection_ms (required), max_recovery_ms (optional), require_clean_timeout (optional). See faults block.
expectnoAssertions evaluated against the trace envelope (see below).

The run-wide caps (token_budget, max_tool_calls, time_budget) all default to off, so a suite that omits them runs exactly as before. When a cap trips, the run stops with a typed error that names the cap and its limit, and the trace records the stopping cap on conversation.stop_reason (one of completed, turn_budget, token_budget, tool_call_budget, time_budget).

Matcher target grammar for agent runs:

TargetWhat it resolves to
tool_calls[i].nameBare tool name the model picked (the <server>__ prefix is stripped before the trace is recorded).
tool_calls[i].serverMCP server the call routed to. Always set, even on single-server runs.
tool_calls[i].args.<path>Arbitrary JSON path into the arguments the model passed.
tool_calls[i].hop_indexZero-based position of the call in the run. Assert a tool was called at a specific step, not just "at some point".
tool_calls[i].agent_idWhich agent made the call, in multi-agent runs. Absent (single-agent) keeps the legacy shape.
tool_calls[i].inputs_digestStable fingerprint of the context the agent had when it made the call. Assert it is unchanged across runs for a determinism check.
tool_calls.lengthTotal number of tool calls in the run.
tool_namesOrdered array of the tool names called, for asserting the whole trajectory in one matcher: exact: [search, fetch] pins the sequence, contains-all: [search] requires a tool, not: { contains: danger } forbids one.
redundant_tool_callsCount of tool calls that repeat an identical (name, args) pair already made earlier in the run. A deterministic backtracking / efficiency signal: assert exact: 0 for a clean plan, or a ceiling.
tool_results[i].isErrorWhether the server flagged the result as an error.
tool_results[i].serverMirror of the matching tool_calls[i].server.
tool_results[i].result.<path>Arbitrary JSON path into the server's response.
final_responseThe model's last plain-text reply. Empty when the loop ran out of turns.
conversation.tokens.total, conversation.tokens.prompt, conversation.tokens.completionCumulative token usage reported by the provider.
conversation.duration_msWall-clock duration of the agent loop.
conversation.message_countNumber of user / assistant / tool messages exchanged.
conversation.stop_reasonWhy the run stopped: completed, turn_budget, token_budget, tool_call_budget, or time_budget.
conversation.per_turn[i].user, .final_response, .tool_calls[j].name, ...Per-user-turn breakdown for a multi-turn run (see below). Empty on a single-prompt run.

Multi-turn conversations

By default an agent test drives one prompt through one tool-using loop. A turns: block drives a multi-turn conversation instead, carrying the conversation and tool state across turns. The top-level trace still aggregates every turn (so tool_calls, final_response, and the single-turn metrics resolve over the whole conversation), and each turn's slice is recorded under conversation.per_turn[i].

Scripted: an ordered list of user messages the driver replays.

agents:
  - name: multi-step trip planning
    model: claude-sonnet-4-5
    servers: [flights, hotels]
    turns:
      - user: "Find a flight SFO to JFK next Friday"
      - user: "Now book a hotel near JFK for that night"
    eval:
      multi_turn_mcp_use: { threshold: 0.6 }   # averages per-turn MCP use

Simulated: an LLM plays the user, generating each turn from a goal, an optional persona, and the conversation so far, until it answers DONE or hits max_turns (default 6). The simulator shares the agent's model, so with no API key it falls back to the deterministic stub and CI stays green.

    turns:
      simulate:
        goal: "Book a same-day round trip and add it to my calendar"
        persona: "terse traveler who changes their mind once"
        max_turns: 6

With a turns: block, prompt is optional. Each turn's slice exposes conversation.per_turn[i].user, .final_response, .tool_calls[j], and .tool_results[j] (with call_index local to the turn), so an expect: can assert on a specific turn, e.g. target: conversation.per_turn[1].tool_calls[0].name.

eval (judged metrics)

An agent test can carry an eval: block of judged metrics that score the run trace after it completes. Each metric produces its own PASS/FAIL report row (alongside the per-run expect: rows) and gates the exit code. Metrics are judged by the agent's own model, so a run with no provider key defers them (reported, not failed) and CI stays green.

agents:
  - name: books a meeting end to end
    model: claude-sonnet-4-5
    servers: [calendar]
    prompt: "Book a meeting with Alice next Tuesday at 2pm and confirm it."
    eval:
      mcp_task_completion: { threshold: 0.6 }   # did the run accomplish the task?
      mcp_use: { threshold: 0.6 }               # right tools, right arguments?
      argument_correctness: {}                   # tool-call arguments correct?
      plan_quality: {}                           # sensible call sequence?
      rubric:                                     # your own weighted criteria
        threshold: 0.8
        criteria:
          - name: booked the right day
            weight: 2
            description: "The agent created an event on the correct Tuesday."
          - name: confirmed to the user
            description: "The final reply confirms the booking."

Metrics:

A rubric can instead be a decision tree: a branching set of yes/no judgments that ends at a fixed score. Each ask node poses one narrow question; the judge answers it, and the run descends the yes or no branch (a yes is a judge score of 0.5 or higher). A score leaf ends the walk with that score (0..1) and an optional reason. One narrow question per node is easier to judge reliably and to audit than one holistic score, and only the questions reach the model. The report shows the exact path taken.

    eval:
      rubric:
        threshold: 0.7
        tree:
          ask: "Did the agent call the get_weather tool?"
          yes:
            ask: "Does the final reply state a temperature?"
            yes: { score: 1.0, reason: "called the tool and reported a temperature" }
            no:  { score: 0.4, reason: "called the tool but gave no temperature" }
          no:    { score: 0.0, reason: "never called the weather tool" }

A tree-mode failure note records the path, e.g. path: Did the agent call the get_weather tool? -> yes, Does the final reply state a temperature? -> no. Give a rubric either criteria or tree, not both.

Every metric scores 0..1 and gates on its threshold. The score and the judge's reasons appear in the report, with secrets redacted.

Auto-detected provider families (set the env var for each model you want to exercise; missing keys fall back to a deterministic stub so CI stays green):

FamilyDetected when model starts withEnv var
Anthropicclaude-ANTHROPIC_API_KEY
OpenAIgpt-, chatgpt-, o<digit>, text-, davinci-OPENAI_API_KEY (+ optional OPENAI_ORG_ID)
Googlegemini-, models/gemini-GEMINI_API_KEY (or GOOGLE_API_KEY)
Mistralmistral-, codestral-, magistral-, ministral-, devstral-, pixtral-, open-mistral-, open-mixtral-MISTRAL_API_KEY

For anything else (Azure OpenAI, OpenRouter, vLLM, LiteLLM, Groq, Together, etc.), see providers block.

faults block

Type: array, optional. Declares named fault injections an agent test opts into with inject:. A fault makes a tool behave like an unresponsive backend so the agent's recovery path is exercised. The executor synthesizes the fault in virtual time (no real hang), so a recovery test is deterministic and CI-bounded.

FieldRequiredPurpose
nameyesUnique fault name an agent test references via inject:.
targetyesWhich calls the fault applies to: { tool: <name> }, { server: <key> }, or both. At least one is required. A missing tool matches any tool; a missing server matches any server.
kindyesOne of hang, wedged, slow, recover_after.
delay_msfor slowMilliseconds before the call answers. A delay at or beyond the run's max_detection_ms reads as unresponsive.
failuresfor recover_afterNumber of calls that hang before the server answers normally.

The agent test that injects a fault gates the run with a recovery: block:

FieldRequiredPurpose
max_detection_msyesPer-call timeout budget. A correctly configured agent gives up on a hung call within this rather than blocking forever.
max_recovery_msnoCap on total recovery time. Omit to leave recovery time unbounded (the gate then only requires that the agent recovered at all).
require_clean_timeoutnoWhen true, also require a clean timeout within the detection budget. Defaults to false.
faults:
  - name: hung-search
    target: { tool: search }
    kind: hang
agents:
  - name: recovers from a hung search tool
    model: claude-sonnet-4-5
    servers: [faulty]
    inject: [hung-search]
    prompt: Search for the latest incident report.
    recovery:
      max_detection_ms: 3000
      max_recovery_ms: 5000
      require_clean_timeout: true

A poor-recovery agent (one that loops on the hung tool until its turn budget trips, never replying) fails the run as a quality failure (exit 1). A dangling inject: name, or a connect failure, is an infra error (exit 2). See Fault injection and recovery for the scoring model.

providers block

Type: object, optional. Declares custom OpenAI-compatible endpoints that agent models: entries can target by name. The wire shape is the OpenAI Chat Completions API, so any gateway or self-hosted server speaking that shape is reachable through one of these.

providers:
  openrouter:
    type: openai
    base_url: https://openrouter.ai/api/v1
    api_key_env: OPENROUTER_API_KEY
  azure-prod:
    type: openai
    base_url: https://my-resource.openai.azure.com/openai/deployments/my-gpt-5
    api_key_env: AZURE_OPENAI_KEY
    organization: my-azure-org           # optional
  local-vllm:
    type: openai
    base_url: http://localhost:8000/v1
    # api_key_env omitted: unauthenticated local endpoint. The runner
    # sends `Authorization: Bearer EMPTY`, which the common
    # self-hosted servers accept.

agents:
  - name: my test
    models:
      - { provider: openrouter, id: anthropic/claude-3.5-sonnet }
      - { provider: azure-prod, id: my-gpt-5 }
    ...

Per-entry fields:

FieldRequiredPurpose
typeyesWire protocol. Today the only supported value is openai.
base_urlyesEndpoint base URL the runner POSTs to.
api_key_envnoEnvironment variable that holds the bearer token. Omit for unauthenticated local endpoints.
organizationnoOptional OpenAI-Organization header.

Worked example: agent-custom-providers.yml.

budget block

Type: object, optional. Per-test and per-suite spend caps applied to agent runs. Both fields are USD cents; missing fields disable that scope.

budget:
  per_test_usd_cents: 50
  per_suite_usd_cents: 500
FieldRequiredPurpose
per_test_usd_centsnoCap on the dollars one agent test can spend. The counter resets between tests.
per_suite_usd_centsnoCap on the dollars an entire suite can spend. Useful when a matrix run multiplies the per-test spend across N models.

When a cap trips the agent loop stops with a clear error before the provider issues a surprise bill.

expect block (the matchers)

Type: array of assertions, optional, default [].

Each assertion combines a target (where to look in the response) with a matcher (how to judge it) and an optional message (human-friendly description used in reporter output).

expect:
  - target: "result.content[0].text"
    matcher:
      contains: "hello"
    message: "tool should greet by name"

Fields per assertion:

expect:
  - target: "result.name"
    transform: jq ascii_downcase
    matcher:
      exact: "echo"

A matcher object carries exactly one key. The valid keys in v1 are exact, contains, regex, schema, snapshot, llm-judge, llm-jury, contains-all, contains-any, icontains, starts-with, is-json, is-valid-tools-call, levenshtein, is-xml, is-sql, similar, cel, factuality, answer-relevance, context-faithfulness, and not. The matcher selection is mutually exclusive at the schema level (minProperties: 1, maxProperties: 1), so the loader rejects a matcher that tries to combine exact: true with regex: "foo".

Every failed assertion produces a structured failure with the following common fields:

{
  "test_name": "echoes the input back",
  "target": "result.content[0].text",
  "matcher": "regex",
  "message": "echo should round-trip the rendered message",
  "expected": "^ping from acct_",
  "actual": "ping from acct_demo"
}

Reporters render this structure into pretty text, JSON, or JUnit XML. The matcher-specific shape is documented per matcher below.

exact matcher

Strict structural equality. Compares the extracted value to the matcher argument using JSON equality rules (objects, arrays, primitives).

expect:
  - target: "result.content[0].text"
    matcher:
      exact: "hello, world"

Failure shape:

{
  "matcher": "exact",
  "expected": "hello, world",
  "actual": "hello, world!",
  "diff": "+!"
}

Use exact when you want a deep-equal match. Reach for contains or regex when the response embeds extra text you do not care about.

See also: contains, regex.

contains matcher

Containment, matched by the type of the value:

Non-string scalars (numbers, booleans, null) compare by equality.

Object subset example (passes even though the actual result has more keys):

expect:
  - target: "result"
    matcher:
      contains:
        isError: false

Array multiset example (passes when every listed element is present):

expect:
  - target: "result.tags"
    matcher:
      contains: ["urgent", "billing"]

Failure shape (object subset, missing key):

{
  "matcher": "contains",
  "path": "/isError",
  "expected": false,
  "note": "missing key `isError` in actual value"
}

See also: exact, icontains, regex.

regex matcher

Tests whether the regex pattern matches anywhere in the stringified value. Uses the regex crate's default syntax.

expect:
  - target: "result.content[0].text"
    matcher:
      regex: "^ping from acct_"

Failure shape:

{
  "matcher": "regex",
  "expected": "^ping from acct_",
  "actual": "pong from svc_demo",
  "match": null
}

When the regex compiles but does not match, match is null. When the regex itself is invalid, the loader fails the entire suite at validation time with a pointer to the offending field.

See also: exact, contains.

schema matcher

Validates the value against an inline JSON Schema. Useful for asserting the shape of a structured tool response without pinning specific values.

expect:
  - target: "result"
    matcher:
      schema:
        type: object
        required: ["content", "isError"]
        properties:
          content:
            type: array
            minItems: 1
          isError:
            type: boolean

Failure shape:

{
  "matcher": "schema",
  "errors": [
    {
      "instance_path": "/isError",
      "schema_path": "/properties/isError/type",
      "message": "expected boolean, got string"
    }
  ]
}

The schema runs through the same jsonschema crate the loader uses, so draft 2020-12 features such as if/then/else, prefixItems, unevaluatedProperties, oneOf/anyOf/allOf, and $defs with internal $ref work as documented. To assert that a value is an array with at least one element (the role length used to play), use type: array with minItems.

Security limits. Two guards run before validation:

  1. External $ref URIs are refused at compile time. Any $ref value that does not start with # (a same-document fragment) fails the matcher with SchemaExternalRef. The validator never reaches the network.
  2. The schema's JSON nesting depth is capped (default 64). A schema deeper than the cap fails with SchemaTooDeep before compile.
  3. Validation iteration runs under a wall-clock budget (default 2 seconds). Exceeding the budget fails the matcher with SchemaValidationTimedOut.

Real-world MCP schemas sit far under these defaults, so a hand-authored suite never trips them. The limits exist to keep a pathological schema or a malicious server response from stalling the runner.

See also: exact, contains, the oneOf / anyOf / allOf composition matchers below.

snapshot matcher

Records the extracted value to disk on the first run, then deep-compares the extracted value against the recorded copy on every subsequent run. A mismatch fails the test with a readable diff. The shorthand is a string (the snapshot key); the long form is an object with optional flags.

expect:
  - target: "result.content"
    matcher:
      snapshot: "lists-tools-content"

The key resolves to a file under the snapshots/ directory next to the suite YAML: snapshot: "lists-tools-content" writes and reads <suite_dir>/snapshots/lists-tools-content.json. Keys may be nested (snapshot: "tools/echo/baseline" becomes <suite_dir>/snapshots/tools/echo/baseline.json); intermediate directories are created on first write. A .json extension is appended automatically unless the key already carries an extension.

On the first run (or when the snapshot file is missing) the matcher records the current value and passes. On later runs it loads the recorded value and compares; the test fails when they differ. To re-record after an intentional change, run with --update-snapshots (or -u), which rewrites every snapshot the run touches and passes. Review the resulting git diff before committing.

mcptest run --update-snapshots

--update-snapshots is refused when CI=true is set, so a CI job never silently rewrites a golden file. Pass --allow-update-in-ci to override.

Long form (per-test update override):

expect:
  - target: "result.content"
    matcher:
      snapshot:
        name: "lists-tools-content"
        update: false

Failure shape:

{
  "matcher": "snapshot",
  "note": "snapshot at snapshots/lists-tools-content.json did not match (run with --update-snapshots to refresh)",
  "expected": "...recorded value...",
  "actual": "...current value..."
}

See also: exact, schema, cassette and snapshot layout.

llm-judge matcher

Routes the candidate string through an LLM with a grading rubric. The matcher passes when the judge returns pass: true or a score at or above threshold. The judge model is invoked through the same provider lookup the agent driver uses (env-var auto-detection plus named providers), so any provider family configured for the suite is fair game. The literature grounding is in docs/research-references.md.

expect:
  - target: final_response
    matcher:
      llm-judge:
        rubric: |
          The response must mention Sacramento and at least one
          temperature number, and must not invent details the tool
          did not return.
        threshold: 0.8
        model: claude-sonnet-4-5    # optional override

Fields under llm-judge:

Failure note: the matcher's diff carries the score and the judge's one-sentence reason, so the reporter shows you why the judge said no.

llm-jury matcher

Like llm-judge, but runs N independent jurors and requires a quorum to pass. Useful when one judge model is itself the subject of the test (so you do not want it grading its own output) or when the assertion is high-stakes enough to want consensus. Reports inter-juror agreement (Krippendorff's alpha) alongside the verdict so a split jury is obvious.

expect:
  - target: final_response
    matcher:
      llm-jury:
        rubric: |
          Reply pass when the response (1) names Sacramento, (2) cites
          the temperature the tool returned, (3) is no longer than two
          sentences.
        jurors:
          - model: claude-sonnet-4-5
          - model: gpt-5
          - model: gemini-2.5-pro
        threshold: 0.7
        quorum: 0.66

Fields under llm-jury:

Jurors that error out (network failure, malformed reply) are recorded as abstaining; they do not count for or against quorum. The reporter shows each juror's verdict and the aggregate pass_fraction.

Worked example: examples/agent-llm-judge.yml.

See also: contains, regex, agents block.

contains-all matcher

The string value must contain every listed substring, or the array value must contain every listed element. The argument is an array of needles.

expect:
  - target: "result.content[0].text"
    matcher:
      contains-all: ["order", "shipped", "tracking"]

contains-any matcher

The string value must contain at least one listed substring, or the array value must contain at least one listed element. An empty list never passes.

expect:
  - target: "result.content[0].text"
    matcher:
      contains-any: ["approved", "accepted"]

icontains matcher

Case-insensitive substring containment. The argument is a single string.

expect:
  - target: "result.content[0].text"
    matcher:
      icontains: "SUCCESS"

starts-with matcher

The string value must start with the given prefix.

expect:
  - target: "result.content[0].text"
    matcher:
      starts-with: "data:image/png;base64,"

is-json matcher

The string value must parse as JSON. Pass ~ (null) for a parse-only check, or an object with a schema key to also validate the parsed document against an inline JSON Schema (the same engine the schema matcher uses).

expect:
  - target: "result.content[0].text"
    matcher:
      is-json:
        schema:
          type: object
          required: ["id", "status"]

is-valid-tools-call matcher

The value must be a well-formed MCP tool call: an object with a string name and an arguments object. Pass ~ (null) for a shape-only check, or an object with a schema key to validate the call's arguments against an inline JSON Schema.

expect:
  - target: "tool_calls[0]"
    matcher:
      is-valid-tools-call:
        schema:
          type: object
          required: ["city"]

levenshtein matcher

The Levenshtein edit distance between the stringified value and the reference value must be at most max. Useful for "close enough" text comparisons without an embedding model. The distance counts insertions, deletions, and substitutions over Unicode scalar values.

expect:
  - target: "result.content[0].text"
    matcher:
      levenshtein:
        value: "the quick brown fox"
        max: 3

is-xml matcher

The string value must parse as well-formed XML. Pass ~ (null) for a well-formedness check, or an object with a root key to also assert the document's root element name. The value is read to EOF with a streaming parser, so any malformed markup (an unclosed tag, mismatched element) fails the assertion.

expect:
  - target: "result.content[0].text"
    matcher:
      is-xml:
        root: "note"

is-sql matcher

The string value must parse as SQL. The value is handed to a real SQL parser using the permissive generic dialect, so it accepts the broadest set of statements rather than pinning one engine's grammar. Useful for a "text to SQL" tool whose output should at least be syntactically valid. The matcher takes no options; pass ~ (null).

expect:
  - target: "result.content[0].text"
    matcher:
      is-sql: ~

similar matcher

Embedding cosine similarity to a reference string. The matcher embeds both the reference value and the actual value with the named embedding model, then passes when the cosine similarity is at least threshold (a floor in 0..1). Use it when two answers can be worded differently but mean the same thing, where levenshtein (character distance) is too literal.

Like llm-judge, this matcher calls an external model, so it needs a configured provider with an embeddings endpoint (OpenAI text-embedding-3-*, Google, Mistral, or a local Ollama embedding model). Anthropic has no embeddings API, so pointing similar at it fails with a clear unsupported-feature error. Because it is async, similar is not composable under the not: wrapper.

expect:
  - target: "result.content[0].text"
    matcher:
      similar:
        value: "the order shipped and is on its way"
        threshold: 0.85
        model: "text-embedding-3-small"

cel matcher

A CEL (Common Expression Language) boolean predicate over the value. Reach for it when the built-in matchers do not express the check you need and you want a deterministic rule rather than an LLM judge. The value at target is bound into the expression as the variable value, so the predicate reads the resolved target directly. Omit target (or set it to "") to bind the whole response envelope as value.

CEL is the same predicate language SBproxy uses, so one expression dialect carries across the gateway and the tester. It evaluates in-process with no model and no network call, so unlike llm-judge it is deterministic and composes under not:. The regex helper is available, so value.matches("^ok") works. The expression must return a boolean; a parse error, an evaluation error, or a non-boolean result fails the run as a suite bug rather than a failed assertion.

Keep the split with transform: in mind: jq under transform: reshapes data (one value in, one value out), while cel: is a boolean predicate. Do not reach for one where the other fits.

expect:
  - target: "result"
    matcher:
      cel: "value.content.size() > 0 && !value.isError"

factuality, answer-relevance, context-faithfulness matchers

Named model-graded matchers with a shared vocabulary for RAG and QA suites. Each is sugar over llm-judge: it carries one field and desugars to a judge call with a vetted, fixed rubric that embeds that field, so the judge plumbing, threshold gating, and calibration all carry over. Like llm-judge, each needs a configured provider, and each takes an optional threshold (default 0.7 on the judge's 0..1 score) and a model override.

expect:
  - target: "result.content[0].text"
    matcher:
      factuality:
        reference: "Paris is the capital of France"
  - target: "result.content[0].text"
    matcher:
      answer-relevance:
        query: "What is the capital of France?"
        threshold: 0.8
  - target: "result.content[0].text"
    matcher:
      context-faithfulness:
        context: "France is a country in Europe. Its capital is Paris."

not matcher

Universal negation. Wraps another matcher and passes when that matcher fails (and fails when it passes). One not: wrapper composes over every deterministic matcher, so there is no per-type not-contains or not-regex variant to learn. Negation is limited to deterministic matchers; wrapping a stateful matcher (snapshot) or an async/LLM matcher (llm-judge, llm-jury, similar) is an evaluation error.

expect:
  - target: "result.content[0].text"
    matcher:
      not:
        contains: "error"

oneOf / anyOf / allOf composition matchers

JSON Schema 2020-12 ships three composition keywords. mcptest exposes the same composition at the matcher level so a suite can compose any matchers, not just schemas, across one extracted target.

oneOf: passes when exactly one inner matcher passes.

expect:
  - target: "result.content[0].text"
    matcher:
      oneOf:
        - exact: "ok"
        - exact: "ready"

anyOf: passes when at least one inner matcher passes.

expect:
  - target: "result.content[0].text"
    matcher:
      anyOf:
        - contains: "ok"
        - contains: "ready"
        - regex: "^accepted-\\d+$"

allOf: passes when every inner matcher passes. Useful for combining a positive predicate with a negative one without two separate assertions.

expect:
  - target: "result"
    matcher:
      allOf:
        - contains: { isError: false }
        - not:
            contains: { content: [{ text: "" }] }

The body must be a non-empty YAML sequence of matchers. An empty sequence is a load-time error.

Composition wraps only deterministic matchers, the same scope as not:. An inner snapshot, llm-judge, llm-jury, or similar is an evaluation error rather than a silent skip, because a composition that silently dropped a branch would change its truth value without the suite author knowing. Structural errors from inner matchers (a bad regex, a malformed schema) surface unchanged rather than being swallowed as a failing branch.

Compositions nest. allOf: [ anyOf: [ ... ], not: { ... } ] is legal and behaves the obvious way.

Budgets and headers (not matchers)

Four expect-level fields look like matchers in casual conversation but live alongside matcher: inside the expect: block, not inside the matcher: object itself. They apply to the whole step, not to one extracted target. To use them, write the expect: block in its long form: an object with an assertions: array plus any of the fields below.

expect:
  assertions:
    - target: "result.content[0].text"
      matcher:
        contains: "hello"
  max_duration_ms: 500
  response_headers:
    content-type: "application/json"
  response_headers_absent: ["set-cookie"]

Naming history

An earlier draft of the schema used equals, length, and jsonpath. Those names were dropped before v1.0 ships; the canonical matcher keys are listed above. Configs that still use the old names will fail schema validation with a clear "additional property not allowed" error.

Test styles

mcptest supports three test styles, written to feel familiar whether you think in matchers, in step-by-step protocol traces, or in raw JSON-RPC.

Flat (default)

The flat style is one test per array entry, one tool call per test. This is what every example above uses. It is the style 95 percent of suites should adopt.

tools:
  - name: "lists tools without error"
    server: filesystem
    tool: "list_directory"
    args:
      path: "/tmp"
    expect:
      - target: "result.content"
        matcher:
          schema:
            type: array
            minItems: 1

  - name: "reads a known fixture"
    server: filesystem
    tool: "read_file"
    args:
      path: "${fixture_path}"
    expect:
      - target: "result.content[0].text"
        matcher:
          contains: "hello"

Stepwise

Stepwise tests run multiple tool calls in order, sharing state between steps. The current schema captures stepwise behavior via separate tools: entries that share ${variables}. A dedicated steps: field is on the roadmap (tracked under a follow-up ticket).

For now, model a multi-step interaction by chaining flat tests against the same server and using a variables: entry to thread shared identifiers:

variables:
  resource_id:
    value: "res_abc123"

tools:
  - name: "step 1: creates a resource"
    server: remote_api
    tool: "create_resource"
    args:
      id: "${resource_id}"
    expect:
      - target: "result.isError"
        matcher:
          exact: false

  - name: "step 2: reads the resource back"
    server: remote_api
    tool: "read_resource"
    args:
      id: "${resource_id}"
    expect:
      - target: "result.content[0].text"
        matcher:
          contains: "${resource_id}"

  - name: "step 3: deletes the resource"
    server: remote_api
    tool: "delete_resource"
    args:
      id: "${resource_id}"
    expect:
      - target: "result.isError"
        matcher:
          exact: false

Raw JSON-RPC

Raw JSON-RPC tests sidestep the tool-call abstraction and assert against the literal protocol envelope. The v1 schema models this through compliance: checks. Use the check: field to name a built-in protocol sequence (for example, initialize or tools/list), then assert on the raw response.

compliance:
  - name: "negotiates capabilities on initialize"
    server: filesystem
    check: "initialize"
    expect:
      - target: "result.protocolVersion"
        matcher:
          regex: "^2\\d{3}-\\d{2}-\\d{2}$"
      - target: "result.capabilities"
        matcher:
          schema:
            type: object
            required: ["tools"]

  - name: "advertises required tools"
    server: filesystem
    check: "tools/list"
    expect:
      - target: "result.tools"
        matcher:
          schema:
            type: array
            minItems: 1

A wider raw-frame style (full request envelope under your control, response captured as a JSON value) is on the roadmap. The current built-in checks are documented in compliance block.

See also: tools block, compliance block.

Variable interpolation

mcptest interpolates ${name} references in any string field at load time (after schema validation, before the first test runs). The available forms:

The full precedence table lives at mcptest.sh/docs/secrets-and-variables.

A worked example combining all forms:

variables:
  base_url:
    value: "https://api.example.com"

tools:
  - name: "uses every interpolation form"
    server: remote_api
    tool: "echo"
    args:
      url: "${base_url}/health"
      token: "${MCPTEST_API_TOKEN:?}"
      label: "${ENV_LABEL:-staging}"
      short: "$ACCOUNT_ID"
      literal: "the price is $$5"
    expect:
      - target: "result.isError"
        matcher:
          exact: false

bearer_token_env vs ${NAME}

These look similar and are not the same.

servers:
  api_a:
    url: "https://a.example.com"
    auth:
      bearer_token_env: "API_A_TOKEN"

  api_b:
    url: "https://b.example.com"
    auth:
      bearer_token_env: "${SECRET_VAR_NAME}"

For api_a, mcptest reads the env var named API_A_TOKEN directly and sends its value as a bearer header. The YAML never holds the secret.

For api_b, mcptest first interpolates ${SECRET_VAR_NAME} to find the name of the env var to read, then reads that env var. This is occasionally useful (rotating tokens with versioned names), but most suites should stick to the literal bearer_token_env: NAME form. Reporters know that the value read by bearer_token_env is a credential and redact it from logs. Plain ${NAME} interpolation gets no such treatment.

See also: variables block, auth block.

imports block

Type: array of strings, optional, default [].

Each entry is a relative or absolute path to another mcptest YAML file. The loader merges imports in order, with later imports overriding earlier ones, and the current file overriding all of its imports. The intended shape:

imports:
  - "./shared/servers.yml"
  - "./shared/variables.yml"

tools:
  - name: "uses an imported server"
    server: shared_filesystem
    tool: "list_directory"
    args:
      path: "/tmp"

The full import implementation lands in a follow-up ticket. Today the loader recognizes the imports: array (the schema validates it) and prints a clear error if any path fails to resolve. Treat this section as a forward contract: write your suites against the documented shape, and the loader will start respecting them when the implementation lands.

Cycle detection is part of the same follow-up. Until then, do not import a file that imports your file. The loader will reject cycles with a structured error once the feature ships.

See also: variables block, servers block.

compliance block

Type: array of objects, optional, default [].

Compliance tests assert that the server speaks the MCP protocol correctly: capability negotiation, error shapes, required method presence. The v1 surface is a small set of built-in checks. The full compliance corpus is on the roadmap and not yet wired up.

compliance:
  - name: "negotiates capabilities on initialize"
    server: filesystem
    check: "initialize"
    expect:
      - target: "result.protocolVersion"
        matcher:
          regex: "^2\\d{3}-\\d{2}-\\d{2}$"

Fields per entry:

The intended shape for the full corpus is a curated set of named checks shipped with the binary. Each check encodes a known-good interaction (for example, initialize issues an initialize request, validates the response, captures the server capabilities) and exposes assertion targets into the captured frames. The shape of those targets is the same as for tool tests, so the matchers in this doc apply unchanged.

See also: expect block, servers block.

evals block

Type: array of objects, optional, default [].

Evaluations grade a response against a rubric and return a score. An entry names a server and a prompt, supplies a rubric, and gates on a threshold. The rubric is the same shape the agent-side eval.rubric matcher uses, so a free-form string, a weighted criteria list, or a decision tree all work here.

A rubric takes one of three forms:

evals:
  # 1. Free-form string: a single holistic judgment.
  - name: "summary stays on topic"
    server: remote_api
    prompt: "Summarize the latest deployment."
    rubric: "Answer must mention the service name and the release tag."
    threshold: 0.7

  # 2. Weighted criteria: the score is the weight-normalized average of the
  #    per-criterion judgments, each reported with its own reason.
  - name: "booking quality"
    server: calendar
    prompt: "Book a meeting with Alice next Tuesday at 2pm and confirm it."
    rubric:
      threshold: 0.8
      criteria:
        - name: booked the right day
          description: "Created an event on the correct Tuesday."
          weight: 2
        - name: confirmed to the user
          description: "The final reply confirms the booking."

  # 3. Decision tree: one yes/no question per node, ending at a fixed score.
  - name: "weather answered"
    server: weather
    prompt: "What is the weather in Paris?"
    rubric:
      threshold: 0.7
      tree:
        ask: "Does the answer state a temperature?"
        yes: { score: 1.0, reason: "reported a temperature" }
        no:  { score: 0.0, reason: "no temperature" }

Fields per entry:

Evaluations run via mcptest eval (the dedicated subcommand). The default mcptest run invocation skips them so a basic CI gate stays cheap. The judge model is resolved from the environment; without an API key (or without a response to grade) each eval defers, reported as passed with a note, so a key-free CI run stays green. See Rubric scoring for the full guide.

scorers block

Type: array of objects, optional, default []. Top-level key (a sibling of servers: and tools:, not nested under evals:).

A scorer is an external grader that returns the same {verdict, score, reasoning, cost_usd} envelope as the built-in judge, so reporters, --max-cost, and verdict caching stay uniform no matter who scored. The build ships the exec scorer; vendor types (braintrust, langsmith, ...) are not included.

scorers:
  - name: factuality
    type: exec
    command: ["python3", "examples/scorers/braintrust.py"]
    env:
      BRAINTRUST_API_KEY: ${BRAINTRUST_API_KEY}
    timeout_ms: 30000

Fields per entry:

A cost_usd: null response is accepted (external scorers often cannot compute cost) and does not count toward --max-cost. See the worked walkthrough and reference wrappers in external scorers.

See also: Research references, External scorers.

model_compatibility block

Type: array of strings, optional, default [].

Pure metadata. Each entry is a model identifier the test suite targets. Reporters use this list to label results when a run is associated with a specific model.

model_compatibility:
  - "claude-opus-4-7"
  - "claude-sonnet-4-5"
  - "gpt-4o"

The full model-compatibility matrix (cross-model pass-fail grids, diff-against-baseline output) is the W8 milestone. v1 only records the list and surfaces it to reporters.

performance block

Type: object, optional, no default fields.

Time and duration units

mcptest uses two time conventions, and the field name tells you which:

When in doubt, prefer the duration-string form where the schema allows it, since the unit is explicit. A bare integer in a duration-string field is rejected at load time.

performance holds two soft budgets that apply suite-wide.

performance:
  default_timeout_ms: 10000
  p95_latency_ms: 2000

Fields:

Deeper performance work (load generation, sustained throughput, percentile-over-time output) lives behind a separate mcptest bench command and is out of scope for v1.

See also: tools block.

Profiles and baseline

Two practical knobs help large suites stay green over time. Both surfaces are stable on the CLI side; the YAML-side handles are in flight.

profile: keyword (in flight)

A profile: keyword on individual tests selects which test groups run for a given invocation. The three named profiles in flight are:

The intended shape at the test-entry level:

tools:
  - name: "fast smoke"
    server: local
    tool: "ping"
    profile: "quick"
    expect:
      - target: "result.isError"
        matcher:
          exact: false

  - name: "expensive scenario"
    server: local
    tool: "stress"
    profile: "full"
    args:
      payload_kb: 1024

The CLI side already accepts --profile. The YAML schema entry for profile: is deferred (tracked under a follow-up ticket). Until it lands, filter via --include / --exclude test name patterns on the CLI.

Baselining known failures

There is no run-level expected-failures baseline yet: mcptest run does not take a --baseline flag. The baseline that ships today is the compliance baseline, a list of compliance rule IDs a server is known to fail, consumed by mcptest compliance run --baseline. A new rule failure flips the gate red; a baselined rule that starts passing is flagged so you can trim the file. See compliance-baseline.md for the file shape and the CI workflow.

Named error scenarios

A fixtures block at the top of the file declares reusable error shapes once, then any tool test can trigger one by name via inject_error:. The runner short-circuits the live dispatch path for these tests: when a tool entry sets inject_error: <name>, the executor synthesizes the {"result": {"jsonrpc":"2.0","error":{...}}} envelope from the named fixture and feeds it through the assertion pipeline as if the server had returned it. No request crosses the wire, no transport is spawned, no cassette is written. This lets a suite exercise the same failure paths (rate limits, expired tokens, 5xx blips) across many tool calls without standing up a stub server.

fixtures:
  server: mock_github
  errors:
    - name: rate_limited
      code: -32000
      message: "GitHub API rate limit exceeded"
      tool: create_issue
    - name: auth_expired
      code: -32001
      message: "OAuth token expired"
      applies_to: any

tools:
  - name: "handles rate limiting gracefully"
    server: mock_github
    tool: create_issue
    inject_error: rate_limited
    args:
      title: "Triage queue overflow"
    expect:
      - target: "result.isError"
        matcher:
          exact: true

Fields under fixtures:

Exactly one of tool or applies_to must be present. The schema rejects entries that set both or neither.

On the test side, inject_error: on a tool entry names a fixture:

The synthesized envelope matches the shape a real MCP server returns for a JSON-RPC error: a top-level result wrapper (so dotted assertion paths rooted at result. resolve), with jsonrpc: "2.0" and an error object that carries code and message. Authoring an assertion against an injected fixture works exactly like asserting against a real failure:

expect:
  - target: "result.error.code"
    matcher:
      exact: -32000
  - target: "result.error.message"
    matcher:
      icontains: "rate limit"

Worked examples ship at examples/named-errors-stdio.yml and examples/named-errors-url.yml.

Compositions

A compositions: block declares a directed acyclic graph (DAG) of tool calls that runs in topological order. Each node names its parents in needs:, and a ${id.field} reference whose id is not in needs: is a load-time error. See compositions.md for the full reference and worked example.

compositions:
  - name: render-top-readme
    nodes:
      - id: search
        tool: search_packages
        args: { query: "mcp" }
      - id: readme
        needs: [search]
        tool: get_readme
        args:
          package: "${search.top_hit.name}"
    output: readme
    expect:
      - target: "nodes.readme.status"
        matcher:
          exact: "ok"

Assertion targets are addressable: composition.ran, composition.order, nodes.<id>.status, nodes.<id>.output, nodes.<id>.duration_ms, and output.* (the combined output from the output: node, or the last node in topological order when output: is omitted).

Cassette references

Cassettes record live MCP traffic to disk so a test can replay it deterministically later. The v1 cassette implementation is in flight in the mcptest-cassette crate. The YAML surface (a cassette: field on a tool test that points to a captured frame set) is documented in a separate reference once the implementation stabilizes.

Until cassettes ship, run against the live server. The schema does not currently validate a cassette: key; do not add one yet, the loader will reject it as unknown.

Putting it all together

A larger example exercising most of the v1 surface:

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json

servers:
  filesystem:
    command: ["npx", "-y", "@modelcontextprotocol/server-filesystem", "/tmp"]
    env:
      LOG_LEVEL: "info"

  remote_api:
    url: "https://mcp.example.com/v1"
    auth:
      bearer_token_env: "MCPTEST_REMOTE_API_TOKEN"
    wait_for_ready: "https://mcp.example.com/healthz"

variables:
  fixture_path:
    value: "/tmp/mcptest-fixture.txt"
  account_id:
    from_env: "MCPTEST_ACCOUNT_ID"
    default: "acct_demo"

tools:
  - name: "lists tools without error"
    server: filesystem
    tool: "list_directory"
    args:
      path: "/tmp"
    expect:
      - target: "result.content"
        matcher:
          schema:
            type: array
            minItems: 1
        message: "directory listing should not be empty"

  - name: "reads a known fixture"
    server: filesystem
    tool: "read_file"
    args:
      path: "${fixture_path}"
    expect:
      - target: "result.content[0].text"
        matcher:
          contains: "hello"

  - name: "echoes the input back"
    server: remote_api
    tool: "echo"
    args:
      message: "ping from ${account_id}"
    expect:
      - target: "result.content[0].text"
        matcher:
          regex: "^ping from acct_"

compliance:
  - name: "negotiates capabilities on initialize"
    server: filesystem
    check: "initialize"

  - name: "advertises required tools"
    server: remote_api
    check: "tools/list"
    expect:
      - target: "result.tools"
        matcher:
          schema:
            type: array
            minItems: 1

evals:
  - name: "summary stays on topic"
    server: remote_api
    prompt: "Summarize the latest deployment."
    rubric: "Answer must mention the service name and the release tag."
    threshold: 0.7

model_compatibility:
  - "claude-opus-4-7"
  - "claude-sonnet-4-5"

performance:
  default_timeout_ms: 15000
  p95_latency_ms: 2000

This file:

Run it with:

mcptest validate path/to/file.yml
mcptest run path/to/file.yml

validate runs schema validation only. run validates first, then executes every test in the file.

Field index

A quick cross-reference, alphabetical by full path.

PathTypeRequiredDefaultSection
compliance[].checkstringyes-compliance
compliance[].expect[]arrayno[]expect
compliance[].namestringyes-compliance
compliance[].serverstringyes-compliance
evals[].namestringyes-evals
evals[].promptstringyes-evals
evals[].rubricstringno-evals
evals[].serverstringyes-evals
evals[].thresholdnumberno0.7evals
fixtures.serverstringno-Named error scenarios
fixtures.errors[].namestringyes-Named error scenarios
fixtures.errors[].codeintegeryes-Named error scenarios
fixtures.errors[].messagestringyes-Named error scenarios
fixtures.errors[].toolstringyes (one of)-Named error scenarios
fixtures.errors[].applies_to"any"yes (one of)-Named error scenarios
imports[]stringno[]imports
model_compatibility[]stringno[]model_compatibility
performance.default_timeout_msintegerno-performance
performance.p95_latency_msintegerno-performance
servers.<name>.commandarrayyes (stdio)-servers
servers.<name>.envobjectno{}servers
servers.<name>.urlstringyes (url)-servers
servers.<name>.auth.bearer_token_envstringyes (bearer)-auth
servers.<name>.auth.oauth.client_id_envstringyes (oauth)-auth
servers.<name>.auth.oauth.authorization_urlstringyes (oauth)-auth
servers.<name>.auth.oauth.token_urlstringyes (oauth)-auth
servers.<name>.auth.oauth.scopesarrayno[]auth
servers.<name>.wait_for_readystringno-servers
tools[].argsobjectno{}tools
tools[].expect[]arrayno[]expect
tools[].expect[].matcher.containsanyyes (one of)-contains
tools[].expect[].matcher.exactanyyes (one of)-exact
tools[].expect[].matcher.llm-judgeobjectyes (one of)-llm-judge
tools[].expect[].matcher.regexstringyes (one of)-regex
tools[].expect[].matcher.schemaobjectyes (one of)-schema
tools[].expect[].matcher.snapshotstring or objectyes (one of)-snapshot
tools[].expect[].messagestringno-expect
tools[].expect[].targetstringyes-expect
tools[].inject_errorstringno-Named error scenarios
tools[].namestringyes-tools
tools[].serverstringyes-tools
tools[].timeout_msintegerno-tools
tools[].toolstringyes-tools
variables.<name>.defaultstringno-variables
variables.<name>.from_envstringno (one of)-variables
variables.<name>.valuescalarno (one of)-variables

A field marked "yes (one of)" is required only when its sibling siblings are absent, per the oneOf constraints in the schema.

Deferred and in-flight summary

For quick reference, every field this page documents that is not yet validated by schemas/v1.json:

Field or featureStatus
servers.<name>.auth.headersdeferred to a future release
servers.<name>.auth.mtlsdeferred to a future release
servers.<name>.headersschema accepts, runner wires up
servers.<name>.httpschema accepts, runner wires up
servers.<name>.http.mtlsdeferred to a future release
tools[].data (parametric inputs)in flight
tools[].expect[].max_duration_msschema accepts, runner wires up
tools[].expect[].max_response_tokensschema accepts, runner wires up
tools[].expect[].response_headersschema accepts, runner wires up
tools[].expect[].response_headers_absentschema accepts, runner wires up
tools[].profilein flight
tools[].cassettein flight (mcptest-cassette crate)
fixtures.errors[] modelshipped
tools[].inject_error injectionshipped
Compliance corpus expansionin flight
Container-managed serversa future release
Full eval graderin flight
Baseline expected-failures filein flight

If you find a field this page does not cover that the schema does validate, that is a doc bug; file an issue against the mcptest project.

See also