YAML test format reference (v1)

A complete walk-through of every field mcptest reads from a test YAML file. The JSON Schema at schemas/v1.json is the source of truth. This page documents the v1 surface the schema actually validates today and flags the fields that are in flight or deferred so you can plan around them without guessing.

Overview

An mcptest configuration is a single YAML file (or a small tree of files joined via imports) that describes:

Which MCP servers to talk to.
Which tool calls, compliance checks, and evaluations to run.
How to judge the responses.

Every file should start with the YAML language server directive so editors that understand JSON Schema can autocomplete fields, surface inline help, and flag typos before you run the test:

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json

The CLI runs the same schema at load time. mcptest validate <file> loads the YAML, applies schemas/v1.json via the jsonschema crate, and prints errors with JSON pointers into the offending fields. The loader rejects unknown keys at the top level and inside every nested object that uses additionalProperties: false in the schema, so a typo in varables: fails the run instead of silently doing nothing.

A minimal file looks like this:

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json

servers:
  local:
    command: ["./target/debug/my-mcp-server"]

tools:
  - name: "lists tools without error"
    server: local
    tool: "list_directory"
    args:
      path: "/tmp"
    expect:
      - target: "result.content"
        matcher:
          schema:
            type: array
            minItems: 1

The rest of this page walks each block top down. Every field entry lists type, whether it is required, the default, a short snippet, and links to related sections.

Top-level keys

Key	Required	Purpose
`servers`	yes	Named MCP servers under test. See `servers` block.
`imports`	no	Other YAML files to merge in. See `imports` block.
`variables`	no	Author-defined variables for `${name}` interpolation. See `variables` block.
`tools`	no	Array of tool-call tests. See `tools` block.
`resources`	no	Array of resource-read tests. See `resources` block.
`prompts`	no	Array of prompt-get tests. See `prompts` block.
`agents`	no	Agent end-to-end tests against one or more models. See `agents` block.
`faults`	no	Named fault injections an agent test references via `inject:`. See `faults` block.
`providers`	no	Custom OpenAI-compatible provider definitions. See `providers` block.
`budget`	no	Per-test and per-suite spend caps for agent runs. See `budget` block.
`compliance`	no	Protocol-level checks against the server. See `compliance` block.
`evals`	no	Rubric or model-graded evaluations. See `evals` block.
`rubrics`	no	Reusable named rubrics referenced from evals via `rubric: { ref: <name> }`. See `evals` block.
`model_compatibility`	no	Metadata list of model identifiers the suite targets. See `model_compatibility` block.
`performance`	no	Per-test and suite-wide latency budgets. See `performance` block.
`target_versions`	no	Run the suite against more than one MCP protocol revision. See `target_versions` block.

Anything else at the top level fails validation. If you need a custom field for a future feature, file an issue before adding it.

`target_versions` block

Type: array of protocol-revision strings. Optional. .

A suite that lists target_versions: declares that it wants to run against every listed MCP revision in turn, not just the runner's default. A server in a deprecation window can be validated against both the old and the new contracts in a single command, and the reporter aggregates per-version pass/fail.

target_versions:
  - "2025-11-25"
  - "2026-07-28"

servers:
  fs:
    command: ["./fs-server"]

The accepted values are the wire strings listed in docs/stateless-transport.md: 2024-11-05, 2025-03-26, 2025-06-18, 2025-11-25, 2026-03-26, 2026-07-28. An unknown value is a load-time error so a typo never silently selects a session-based fallback. Duplicates are de-duped while preserving declaration order so the reporter prints per-version results in the order the suite listed them.

CLI override. mcptest run --target-version <V> selects a single entry from the suite's list, so a CI matrix can keep one job per protocol revision without sharding the suite. The override must parse via ProtocolVersion::from_wire and, when the suite declares a list, must be one of its entries; mismatched overrides fail with a pointer at the declared list. When the suite declares no list, the override still works as a one-off "run against this revision only" knob.

Selector library. The pure mcptest_core::target_versions::effective(suite_versions, cli_override) function returns the list the runner walks. The runner integration that calls it (per-version fan-out + per-version report aggregation) is the next focused commit on this work; the selector ships today so the CLI flag and the integration land in isolation.

`servers` block

Type: object, required, at least one entry.

Each key under servers is the name a test block refers to via its server: field. The value is one of three shapes: a subprocess specification (command:), a URL specification (url:), or a cassette specification (cassette:) that replays a recording instead of reaching a live server. Exactly one of the three must be present; the schema enforces this with oneOf.

Subprocess (stdio) servers

Use a subprocess server when the MCP server is a local binary, an npx package, or anything that speaks MCP over stdio.

servers:
  filesystem:
    command: ["npx", "-y", "@modelcontextprotocol/server-filesystem", "/tmp"]
    env:
      LOG_LEVEL: "info"

Fields:

command (array of strings, required, no default). The argv to spawn. First element is the executable; remaining elements are arguments. Must contain at least one entry.
env (object of string to string, optional, default {}). Extra environment variables passed to the child process. Values may reference parent env via ${VAR} interpolation (see Variable interpolation).

Snippet for a Rust binary built in the workspace:

servers:
  local:
    command: ["./target/debug/my-mcp-server", "--config", "./fixtures/test.toml"]
    env:
      RUST_LOG: "debug"

CLI server-target overrides

Four global CLI flags rewrite the server: block at runtime so the same YAML can target a freshly-deployed preview or CI environment whose URL or command is not knowable at authoring time. The flags live on the global parser, so they go before the subcommand: mcptest --server-url https://preview run.

Flag	Effect
`--server-url <URL>`	Replace `server.url`. If the YAML had a `command:`, it is removed. Mutually exclusive with `--server-command`.
`--server-command <CMD>`	Replace `server.command`. `<CMD>` is split with POSIX shell rules (`shell-words`), so `--server-command "./dev --debug"` parses to `["./dev", "--debug"]`. Mutually exclusive with `--server-url`.
`--server-auth-bearer-env <NAME>`	Set `server.auth.bearer_token_env` for URL targets.
`--server-config <PATH>`	Load a YAML file containing a full `server:` block and use it in place of the in-suite block.

Precedence (lowest to highest):

The YAML file's own server: block.
--server-config (full-block replacement).
--server-url, --server-command, --server-auth-bearer-env (single-field overrides). A single-field flag wins over the same field in --server-config.

When any override flag is set, the runner prints a one-line banner naming the flags before it runs so the operator can spot a misconfigured CI job without re-reading the YAML.

URL (HTTP or SSE) servers

Use a URL server when the MCP server is already running, lives in a container, or is reachable over the network.

servers:
  remote_api:
    url: "https://mcp.example.com/v1"
    auth:
      bearer_token_env: "MCPTEST_REMOTE_API_TOKEN"
    wait_for_ready: "https://mcp.example.com/healthz"

Fields:

url (string, required, no default). Full URL to the MCP endpoint. Must parse as a URI. Scheme is http(s) for HTTP transport or http(s) with SSE content negotiation.
auth (object, optional, no default). How to authenticate. See auth block.
wait_for_ready (string, optional, no default). URL or path the runner polls until it returns a 2xx response before sending MCP requests. Use this to coordinate with slow-starting containers.

The headers and mtls keys are reserved on the auth object for a future release and later. They are not validated against any shape in v1, and the loader prints a warning if it sees them. See the deferred notes in auth block.

Container-managed servers are documented briefly only. Full container support (lifecycle, networking, healthchecks) is a future release. Until then, run your container yourself and point a URL server at it.

Cassette (replay) servers

Use a cassette server to replay a recorded MCP server instead of reaching a live one. The run never touches the network: every request is matched against the recorded exchanges and the recorded response is returned. This is the offline, deterministic, key-free path for CI.

servers:
  recorded:
    cassette: ./cassettes/issues-server.json

Fields:

cassette (string, required). Path to the cassette JSON file, resolved relative to the working directory. ${VAR} interpolation is supported.

cassette: is mutually exclusive with command: and url:; setting more than one is rejected at load time. See Cassettes for the file format and how matching works.

`auth` block

Authentication strategy for URL servers. Exactly one of bearer_token_env or oauth must be present in v1.

Bearer token from env

servers:
  remote_api:
    url: "https://mcp.example.com/v1"
    auth:
      bearer_token_env: "MCPTEST_REMOTE_API_TOKEN"

Fields:

bearer_token_env (string, required, no default). Name of the environment variable that holds the bearer token. mcptest sends Authorization: Bearer ${value} where value is read from that env var at request time. The token never appears in the YAML file.

Note the distinction between bearer_token_env: NAME and ${NAME}. The former tells mcptest "read this env var into the Authorization header." The latter is generic string interpolation usable in any string field. Use bearer_token_env for auth credentials; the loader applies a redaction pass to those values in reporter output. Use ${NAME} for everything else. This is covered again in Variable interpolation.

OAuth 2.0 client credentials

servers:
  remote_api:
    url: "https://mcp.example.com/v1"
    auth:
      oauth:
        client_id_env: "MCPTEST_OAUTH_CLIENT_ID"
        authorization_url: "https://auth.example.com/oauth/authorize"
        token_url: "https://auth.example.com/oauth/token"
        scopes:
          - "mcp:read"
          - "mcp:invoke"

Fields under oauth:

client_id_env (string, required, no default). Env var holding the OAuth client id. The matching client secret is read from ${client_id_env}_SECRET by convention (so a client_id_env of MCPTEST_OAUTH_CLIENT_ID resolves the secret from MCPTEST_OAUTH_CLIENT_ID_SECRET).
authorization_url (string, required, no default). Authorization endpoint per RFC 6749 section 3.1.
token_url (string, required, no default). Token endpoint per RFC 6749 section 3.2.
scopes (array of non-empty strings, optional, default []). Scopes to request. Omit when the authorization server has a default scope.

Deferred auth fields

headers on auth (object of string to string, deferred to a future release). Will accept a map of header names to env-backed values. The schema is permissive so configs that opt in early do not break loading. The loader logs a warning today. Custom headers for non-credential use cases live on the URL server block as server.headers, see Custom headers and HTTP transport.
mtls on auth (object, deferred to a future release). Will accept paths to client cert and key plus an optional CA bundle. Same permissive shape, same warning. Server-side TLS knobs (insecure_skip_verify, custom CA bundle, minimum protocol version) ship today under server.http.tls.

Custom headers and HTTP transport

URL servers accept two optional blocks that configure the HTTP transport without touching auth:. The server.headers block sends custom headers on every request, and the server.http block tunes timeouts, redirect handling, and basic TLS knobs. mTLS (client certificates) is deferred to a future release; the model carries verify on/off, CA bundle, and minimum protocol version only.

`server.headers`

Type: object, optional, default {}.

Each entry maps a header name to a value. The value is either a literal string (with ${VAR} interpolation against variables) or an object {env: NAME} that the runner reads from the environment at connect time. Pick the env form for secret-flavored headers so they never land in the YAML file.

servers:
  saas_api:
    url: "https://api.example.com/mcp"
    headers:
      X-Tenant: "acme"
      X-API-Key:
        env: ACME_TENANT_API_KEY
      X-Trace-Id: "${request_id}"

Fields per entry:

Literal string (string, no nested fields). Sent verbatim on the wire after ${VAR} interpolation runs.
Object {env: NAME} (one key env, required). Reads NAME from the environment at connect time. Fails the run with a clear error if NAME is unset.

Authorization and Proxy-Authorization are rejected here. Use the auth: block for bearer tokens and OAuth, see auth block. Header names must be valid RFC 7230 token characters (letters, digits, and the punctuation set ! # $ % & ' * + - . ^ _ | ~`). Names with spaces or colons fail validation.

`server.http`

Type: object, optional, default {}.

Controls HTTP transport behavior for URL servers. Stdio servers ignore the block. Duration fields take a number with a suffix: 30s, 500ms, 1m, 1h. Bare numbers without a unit are rejected so a future reader can tell what the value means.

servers:
  saas_api:
    url: "https://api.example.com/mcp"
    http:
      timeout: 30s
      connect_timeout: 5s
      max_redirects: 5
      user_agent_override: true
      tls:
        insecure_skip_verify: false
        ca_bundle_path: "/etc/ssl/internal-ca.pem"
        min_version: "1.2"

Fields:

timeout (duration string, optional, default 30s). Per-request timeout. The runner aborts the request when the body has not fully arrived in this window.
connect_timeout (duration string, optional, default 5s). TCP and TLS connect timeout. Fires before timeout does on a stuck connection.
max_redirects (integer 0 to 50, optional, default 5). Maximum number of redirects the runner follows on a single request. Set to 0 to disable redirects.
user_agent_override (boolean, optional, default true). When true, the runner sets a User-Agent header identifying mcptest. Disable when an upstream gateway gates on a specific user-agent string.
tls (object, optional, default {}). Basic TLS knobs, documented next.

Fields under tls:

insecure_skip_verify (boolean, optional, default false). Skip certificate verification. Dangerous; only use against a private staging endpoint with a self-signed certificate. The CLI flag --insecure-skip-verify sets this for the whole run and prints a WARNING banner on stderr.
ca_bundle_path (string, optional, no default). Path to a PEM-encoded CA bundle. Use when an internal certificate authority issues your endpoint's certificate.
min_version ("1.2" or "1.3", optional, default "1.2"). Minimum TLS protocol version. The runner refuses to negotiate below this floor.

CLI flags for headers and HTTP transport

Five global CLI flags map to the same surface so the same YAML can target a preview environment without editing the file. All flags are repeatable where applicable.

Flag	Effect
`--header NAME=VALUE`	Append a literal header. Repeatable. Rejects `Authorization` and `Proxy-Authorization`.
`--header-env NAME=VAR`	Append an env-backed header. Repeatable. Reads `VAR` from the environment at connect time.
`--insecure-skip-verify`	Disable TLS verification (loud WARNING banner).
`--ca-bundle PATH`	Path to a PEM-encoded CA bundle.
`--http-timeout SECONDS`	Override `server.http.timeout`.
`--connect-timeout SECONDS`	Override `server.http.connect_timeout`.

Precedence (lowest to highest):

server.headers and server.http from the YAML file.
--header, --header-env, --ca-bundle, --http-timeout, --connect-timeout, --insecure-skip-verify. CLI flags overwrite YAML values per header name (case-insensitive).

Worked examples

Cloudflare Access:

servers:
  protected:
    url: "https://mcp.internal.example.com/v1"
    headers:
      CF-Access-Client-Id:
        env: CF_ACCESS_CLIENT_ID
      CF-Access-Client-Secret:
        env: CF_ACCESS_CLIENT_SECRET

Google Cloud IAP:

servers:
  gcp:
    url: "https://iap.example.com/mcp"
    headers:
      Proxy-Authorization-IAP:
        env: GCP_IAP_TOKEN
    http:
      timeout: 60s

AWS API Gateway with an API key:

servers:
  awsg:
    url: "https://abc123.execute-api.us-east-1.amazonaws.com/prod/mcp"
    headers:
      x-api-key:
        env: AWSG_API_KEY

Multi-tenant SaaS with a tenant header:

servers:
  saas:
    url: "https://api.example.com/mcp"
    headers:
      X-Tenant: "acme"
      X-Plan: "enterprise"

Distributed tracing:

servers:
  traced:
    url: "https://api.example.com/mcp"
    headers:
      X-Trace-Id: "${run_id}"
      traceparent: "${otel_traceparent}"

Internal CA bundle (corporate intranet endpoint):

servers:
  internal:
    url: "https://mcp.corp.example.com/v1"
    http:
      tls:
        ca_bundle_path: "/etc/ssl/corp-internal-ca.pem"
        min_version: "1.3"

mTLS (client certificates) is deferred to a future release; this section will gain an mtls: block alongside tls: when that work lands.

`variables` block

Type: object, optional, default {}.

variables holds author-defined values usable inside any string field via ${name} interpolation. Each entry is either a literal value or a reference from_env, never both.

variables:
  fixture_path:
    value: "/tmp/mcptest-fixture.txt"

  account_id:
    from_env: "MCPTEST_ACCOUNT_ID"
    default: "acct_demo"

Fields per entry:

value (string, number, or boolean, optional, no default). Literal value. Use this for paths, identifiers, and other static data.
from_env (string, optional, no default). Name of an environment variable to read. Mutually exclusive with value.
default (string, optional, no default). Fallback value when from_env is unset. Only meaningful alongside from_env.

Resolution precedence is documented at mcptest.sh/docs/secrets-and-variables. The short version, in order from highest to lowest precedence:

Process environment variables (export VAR=... or whatever your shell or CI runner set up).
Values written to a loaded dotenv file (.env next to the test file, or a file passed via --env-file).
The default: field on a from_env variable.
The value: field on a literal variable.

Variables resolve once when the configuration loads. References that fail to resolve (env var unset, no default) raise a structured error before any test runs.

variables:
  api_base:
    value: "https://staging.example.com"

  auth_token:
    from_env: "MCPTEST_TOKEN"
    default: "dev-only-token"

tools:
  - name: "fetches a resource"
    server: remote_api
    tool: "fetch"
    args:
      url: "${api_base}/resources/42"

`tools` block

Type: array of objects, optional, default [].

Each entry is a tool-call test: invoke a single tool on a named server and run assertions against the result.

tools:
  - name: "lists tools without error"
    server: filesystem
    tool: "list_directory"
    args:
      path: "/tmp"
    expect:
      - target: "result.content"
        matcher:
          schema:
            type: array
            minItems: 1
    timeout_ms: 5000

Fields per entry:

name (string, required, no default). Human-readable test name. Appears in reporter output and JUnit summaries.
server (string, required, no default). Key into the top-level servers map. The test fails to load if the name is not registered.
tool (string, required, no default). Tool name the server exposes. The runner checks the name against tools/list when available.
args (object, optional, default {}). JSON-serializable arguments passed to the tool. Values may use ${name} interpolation against variables.
expect (array of assertions, optional, default []). Assertions evaluated against the tool result. See expect block.
timeout_ms (integer, optional, no default). Per-test timeout override in milliseconds. Falls back to performance.default_timeout_ms when unset. See performance block.

Parametric inputs (a data: array that runs the same test once per row) are deferred. The intended shape is one entry per row, with each row substituted into the args block via ${data.field} interpolation. Until that lands, duplicate the test or use a small shell loop.

A worked example using both literal and env-backed variables:

variables:
  greeting:
    value: "hello"
  who:
    from_env: "MCPTEST_GREET_WHO"
    default: "world"

tools:
  - name: "echoes a greeting"
    server: local
    tool: "echo"
    args:
      message: "${greeting}, ${who}"
    expect:
      - target: "result.content[0].text"
        matcher:
          exact: "hello, world"
        message: "echo should round-trip the rendered message"

`tags` (optional, on every test type)

Type: array of strings. Optional. Available on tool tests, resource tests, prompt tests, and agent tests.

tools:
  - name: search returns at least one result
    server: prod-api
    tool: search
    args:
      query: "weather"
    tags: ["smoke", "search"]
    expect:
      - target: "result.content"
        matcher:
          contains: "weather"

mcptest run --tag smoke keeps tests whose tag list contains smoke. mcptest run --skip-tag slow drops tests whose tag list contains slow. Multiple --tag flags are OR'd; --skip-tag wins when a test matches both.

For agent tests the tag applies to every (agent, model) row the matrix expands to, so --tag smoke keeps or drops every model under that agent as a unit. See cli-reference.md for the full flag matrix.

`transform` (optional, on tool tests)

Type: object with optional request and response command strings. Optional. Tool tests only.

A transform is a subprocess that rewrites the outbound request before it is sent, the response before assertions run, or both. The command reads one JSON value on stdin and writes the replacement value on stdout, so plain jq, node, or python work without any framing protocol.

tools:
  - name: search with normalized response
    server: prod-api
    tool: search
    args:
      query: "weather"
    transform:
      request: jq '.arguments.query |= ascii_downcase'
      response: ./transforms/strip-ids.sh
    expect:
      - target: result.content[0].text
        matcher:
          icontains: weather

The request command receives { "name": ..., "arguments": ... }; the response command receives { "result": ... }. A non-zero exit, a timeout, or unparseable stdout fails the test. You can set a default transform under defaultTest: and override it per test. See transforms.md for the full contract, the environment context, and worked examples in jq, Node, and Python.

`resources` block

Type: array of objects, optional, default [].

Each entry is a resource-read test: read a single resource by URI from a named server (resources/read) and run assertions against the result envelope.

resources:
  - name: "readme resource is text"
    server: filesystem
    uri: "file:///workspace/README.md"
    expect:
      - target: "result.contents[0].mimeType"
        matcher:
          icontains: "text"

Fields per entry:

name (string, required). Human-readable test name.
server (string, required). Key into the top-level servers map.
uri (string, required). Resource URI to read via resources/read.
expect (array, optional). Assertions against the result. The result shape is { "contents": [{ "uri", "mimeType"?, "text" }] }, so targets read like result.contents[0].text.
tags, threshold, and derivedMetrics work the same as on tool tests.

`prompts` block

Type: array of objects, optional, default [].

Each entry is a prompt-get test: fetch a single prompt by name from a named server (prompts/get) and run assertions against the result envelope.

prompts:
  - name: "bug-triage prompt renders"
    server: filesystem
    prompt: bug_triage
    args:
      severity: high
    expect:
      - target: "result.messages[0].content.text"
        matcher:
          icontains: "triage"

Fields per entry:

name (string, required). Human-readable test name.
server (string, required). Key into the top-level servers map.
prompt (string, required). Prompt name to fetch via prompts/get.
args (object, optional). JSON-serializable arguments passed to the prompt.
expect (array, optional). Assertions against the result. The result shape is { "messages": [{ "role", "content": { "type", "text" } }] }, so targets read like result.messages[0].content.text.
tags, threshold, and derivedMetrics work the same as on tool tests.

Both run against any MCP server, including the built-in mcptest mock, which serves resources/* and prompts/* from its manifest.

`agents` block

Type: array, optional. Each entry runs a real LLM against one or more MCP servers and asserts against the resulting conversation trace. Background and design rationale live in docs/models.md and docs/concepts.md.

agents:
  - name: weather query routes to get_weather
    # Either `model:` (singleton) or `models:` (matrix).
    models:
      - claude-sonnet-4-5                    # auto-detect family
      - { provider: openrouter, id: openai/gpt-4o }   # named provider
    servers: [weather]                                # one or more MCP servers
    prompt: What is the weather in Sacramento?
    system_prompt: |                                   # optional
      You are a weather assistant.
    max_turns: 10                                      # optional, default 10
    max_tokens: 1024                                   # optional, default 1024 (per call)
    token_budget: 50000                                # optional, cumulative tokens across the run
    max_tool_calls: 20                                 # optional, total tool calls across the run
    time_budget: 30s                                   # optional, wall-clock deadline for the run
    expect:
      - target: tool_calls[0].name
        matcher: { exact: get_weather }
      - target: tool_names                             # the whole trajectory, in order
        matcher: { contains-all: [get_weather] }
      - target: redundant_tool_calls                   # no repeated (name, args) calls
        matcher: { exact: 0 }
      - target: conversation.tokens.total
        matcher: { regex: "^[0-9]+$" }

Per-entry fields:

Field	Required	Purpose
`name`	yes	Test name shown in the reporter.
`model` or `models`	yes	Exactly one. `model` is sugar for `models: [<value>]`. Each entry under `models:` is either a bare string (auto-detected family) or `{ provider: <name>, id: <model> }` referencing a `providers:` entry.
`servers`	yes	At least one server key from the top-level `servers:` map. The driver exposes the merged tool catalog from every named server to the model.
`prompt`	yes*	User prompt that kicks off the conversation. *Optional when a `turns:` block supplies the user side (see Multi-turn conversations).
`turns`	no	Drive a multi-turn conversation instead of a single `prompt`: a scripted list of `{ user: ... }` turns, or a `{ simulate: ... }` block where an LLM plays the user. See Multi-turn conversations.
`system_prompt`	no	Optional system prompt.
`max_turns`	no	Hard cap on tool-use iterations. Defaults to 10.
`max_tokens`	no	Per-call `max_tokens` for the model. Defaults to 1024.
`token_budget`	no	Cumulative token budget across the whole run. Distinct from per-call `max_tokens`: this sums every turn's usage and stops the run once the total crosses the cap. Omit for an unbounded run.
`max_tool_calls`	no	Cap on total tool invocations across the run. The run stops once the count crosses the cap. Omit for an unbounded run.
`time_budget`	no	Wall-clock deadline for the run as a human duration string (`30s`, `2m`, `1h`, `500ms`). The run stops at the next loop boundary once the deadline passes. Omit for an unbounded run.
`inject`	no	Names of top-level `faults` to inject into this run. A call matching an injected fault is synthesized as an unresponsive timeout.
`recovery`	no	Recovery gate scored against the injected fault: `max_detection_ms` (required), `max_recovery_ms` (optional), `require_clean_timeout` (optional). See `faults` block.
`expect`	no	Assertions evaluated against the trace envelope (see below).

The run-wide caps (token_budget, max_tool_calls, time_budget) all default to off, so a suite that omits them runs exactly as before. When a cap trips, the run stops with a typed error that names the cap and its limit, and the trace records the stopping cap on conversation.stop_reason (one of completed, turn_budget, token_budget, tool_call_budget, time_budget).

Matcher target grammar for agent runs:

Target	What it resolves to
`tool_calls[i].name`	Bare tool name the model picked (the `<server>__` prefix is stripped before the trace is recorded).
`tool_calls[i].server`	MCP server the call routed to. Always set, even on single-server runs.
`tool_calls[i].args.<path>`	Arbitrary JSON path into the arguments the model passed.
`tool_calls[i].hop_index`	Zero-based position of the call in the run. Assert a tool was called at a specific step, not just "at some point".
`tool_calls[i].agent_id`	Which agent made the call, in multi-agent runs. Absent (single-agent) keeps the legacy shape.
`tool_calls[i].inputs_digest`	Stable fingerprint of the context the agent had when it made the call. Assert it is unchanged across runs for a determinism check.
`tool_calls.length`	Total number of tool calls in the run.
`tool_names`	Ordered array of the tool names called, for asserting the whole trajectory in one matcher: `exact: [search, fetch]` pins the sequence, `contains-all: [search]` requires a tool, `not: { contains: danger }` forbids one.
`redundant_tool_calls`	Count of tool calls that repeat an identical (name, args) pair already made earlier in the run. A deterministic backtracking / efficiency signal: assert `exact: 0` for a clean plan, or a ceiling.
`tool_results[i].isError`	Whether the server flagged the result as an error.
`tool_results[i].server`	Mirror of the matching `tool_calls[i].server`.
`tool_results[i].result.<path>`	Arbitrary JSON path into the server's response.
`final_response`	The model's last plain-text reply. Empty when the loop ran out of turns.
`conversation.tokens.total`, `conversation.tokens.prompt`, `conversation.tokens.completion`	Cumulative token usage reported by the provider.
`conversation.duration_ms`	Wall-clock duration of the agent loop.
`conversation.message_count`	Number of user / assistant / tool messages exchanged.
`conversation.stop_reason`	Why the run stopped: `completed`, `turn_budget`, `token_budget`, `tool_call_budget`, or `time_budget`.
`conversation.per_turn[i].user`, `.final_response`, `.tool_calls[j].name`, ...	Per-user-turn breakdown for a multi-turn run (see below). Empty on a single-`prompt` run.

Multi-turn conversations

By default an agent test drives one prompt through one tool-using loop. A turns: block drives a multi-turn conversation instead, carrying the conversation and tool state across turns. The top-level trace still aggregates every turn (so tool_calls, final_response, and the single-turn metrics resolve over the whole conversation), and each turn's slice is recorded under conversation.per_turn[i].

Scripted: an ordered list of user messages the driver replays.

agents:
  - name: multi-step trip planning
    model: claude-sonnet-4-5
    servers: [flights, hotels]
    turns:
      - user: "Find a flight SFO to JFK next Friday"
      - user: "Now book a hotel near JFK for that night"
    eval:
      multi_turn_mcp_use: { threshold: 0.6 }   # averages per-turn MCP use

Simulated: an LLM plays the user, generating each turn from a goal, an optional persona, and the conversation so far, until it answers DONE or hits max_turns (default 6). The simulator shares the agent's model, so with no API key it falls back to the deterministic stub and CI stays green.

    turns:
      simulate:
        goal: "Book a same-day round trip and add it to my calendar"
        persona: "terse traveler who changes their mind once"
        max_turns: 6

With a turns: block, prompt is optional. Each turn's slice exposes conversation.per_turn[i].user, .final_response, .tool_calls[j], and .tool_results[j] (with call_index local to the turn), so an expect: can assert on a specific turn, e.g. target: conversation.per_turn[1].tool_calls[0].name.

`eval` (judged metrics)

An agent test can carry an eval: block of judged metrics that score the run trace after it completes. Each metric produces its own PASS/FAIL report row (alongside the per-run expect: rows) and gates the exit code. Metrics are judged by the agent's own model, so a run with no provider key defers them (reported, not failed) and CI stays green.

agents:
  - name: books a meeting end to end
    model: claude-sonnet-4-5
    servers: [calendar]
    prompt: "Book a meeting with Alice next Tuesday at 2pm and confirm it."
    eval:
      mcp_task_completion: { threshold: 0.6 }   # did the run accomplish the task?
      mcp_use: { threshold: 0.6 }               # right tools, right arguments?
      argument_correctness: {}                   # tool-call arguments correct?
      plan_quality: {}                           # sensible call sequence?
      rubric:                                     # your own weighted criteria
        threshold: 0.8
        criteria:
          - name: booked the right day
            weight: 2
            description: "The agent created an event on the correct Tuesday."
          - name: confirmed to the user
            description: "The final reply confirms the booking."

Metrics:

mcp_task_completion (default threshold 0.5): did the run accomplish the task in the prompt?
mcp_use (default 0.5): were the right tools called with sensible arguments for the intent?
argument_correctness (default 0.5): are each tool call's arguments correct and complete for the intent?
plan_quality (default 0.5): is the sequence of tool calls a sensible plan (no redundancy, ordering, or gaps)?
multi_turn_mcp_use (default 0.5): MCP-use alignment averaged across a multi-turn conversation. Each conversation.per_turn slice is judged for tool-use quality and the per-turn scores are averaged. Requires a turns: block (see below); on a single-prompt run it defers (reported, not failed).
rubric (default threshold 0.7): your own custom rubric, in one of two forms. The flat form is a weighted, multi-criterion list: each criteria[] entry is name + description (the judge rubric) + an optional weight (default 1.0), and the score is the weight-normalized average of the per-criterion judgments, each reported with its own reason. Add strict: true to require a perfect score. Add require_evidence: true so each criterion's judge must cite a verbatim span from the candidate; a criterion with no evidence scores 0 and gates the eval. A criterion may carry examples: (labeled response + score pairs) that are appended to its judge prompt as few-shot calibration anchors. A criterion may also carry a when: predicate ({ contains: <str> } or { regex: <pattern> }) so it is judged only when the candidate matches, and its own threshold: that overrides the rubric default for that criterion's gate.

A rubric can instead be a decision tree: a branching set of yes/no judgments that ends at a fixed score. Each ask node poses one narrow question; the judge answers it, and the run descends the yes or no branch (a yes is a judge score of 0.5 or higher). A score leaf ends the walk with that score (0..1) and an optional reason. One narrow question per node is easier to judge reliably and to audit than one holistic score, and only the questions reach the model. The report shows the exact path taken.

    eval:
      rubric:
        threshold: 0.7
        tree:
          ask: "Did the agent call the get_weather tool?"
          yes:
            ask: "Does the final reply state a temperature?"
            yes: { score: 1.0, reason: "called the tool and reported a temperature" }
            no:  { score: 0.4, reason: "called the tool but gave no temperature" }
          no:    { score: 0.0, reason: "never called the weather tool" }

A tree-mode failure note records the path, e.g. path: Did the agent call the get_weather tool? -> yes, Does the final reply state a temperature? -> no. Give a rubric either criteria or tree, not both.

Every metric scores 0..1 and gates on its threshold. The score and the judge's reasons appear in the report, with secrets redacted.

Auto-detected provider families (set the env var for each model you want to exercise; missing keys fall back to a deterministic stub so CI stays green):

Family	Detected when model starts with	Env var
Anthropic	`claude-`	`ANTHROPIC_API_KEY`
OpenAI	`gpt-`, `chatgpt-`, `o<digit>`, `text-`, `davinci-`	`OPENAI_API_KEY` (+ optional `OPENAI_ORG_ID`)
Google	`gemini-`, `models/gemini-`	`GEMINI_API_KEY` (or `GOOGLE_API_KEY`)
Mistral	`mistral-`, `codestral-`, `magistral-`, `ministral-`, `devstral-`, `pixtral-`, `open-mistral-`, `open-mixtral-`	`MISTRAL_API_KEY`

For anything else (Azure OpenAI, OpenRouter, vLLM, LiteLLM, Groq, Together, etc.), see providers block.

`faults` block

Type: array, optional. Declares named fault injections an agent test opts into with inject:. A fault makes a tool behave like an unresponsive backend so the agent's recovery path is exercised. The executor synthesizes the fault in virtual time (no real hang), so a recovery test is deterministic and CI-bounded.

Field	Required	Purpose
`name`	yes	Unique fault name an agent test references via `inject:`.
`target`	yes	Which calls the fault applies to: `{ tool: <name> }`, `{ server: <key> }`, or both. At least one is required. A missing `tool` matches any tool; a missing `server` matches any server.
`kind`	yes	One of `hang`, `wedged`, `slow`, `recover_after`.
`delay_ms`	for `slow`	Milliseconds before the call answers. A delay at or beyond the run's `max_detection_ms` reads as unresponsive.
`failures`	for `recover_after`	Number of calls that hang before the server answers normally.

The agent test that injects a fault gates the run with a recovery: block:

Field	Required	Purpose
`max_detection_ms`	yes	Per-call timeout budget. A correctly configured agent gives up on a hung call within this rather than blocking forever.
`max_recovery_ms`	no	Cap on total recovery time. Omit to leave recovery time unbounded (the gate then only requires that the agent recovered at all).
`require_clean_timeout`	no	When true, also require a clean timeout within the detection budget. Defaults to false.

faults:
  - name: hung-search
    target: { tool: search }
    kind: hang
agents:
  - name: recovers from a hung search tool
    model: claude-sonnet-4-5
    servers: [faulty]
    inject: [hung-search]
    prompt: Search for the latest incident report.
    recovery:
      max_detection_ms: 3000
      max_recovery_ms: 5000
      require_clean_timeout: true

A poor-recovery agent (one that loops on the hung tool until its turn budget trips, never replying) fails the run as a quality failure (exit 1). A dangling inject: name, or a connect failure, is an infra error (exit 2). See Fault injection and recovery for the scoring model.

`providers` block

Type: object, optional. Declares custom OpenAI-compatible endpoints that agent models: entries can target by name. The wire shape is the OpenAI Chat Completions API, so any gateway or self-hosted server speaking that shape is reachable through one of these.

providers:
  openrouter:
    type: openai
    base_url: https://openrouter.ai/api/v1
    api_key_env: OPENROUTER_API_KEY
  azure-prod:
    type: openai
    base_url: https://my-resource.openai.azure.com/openai/deployments/my-gpt-5
    api_key_env: AZURE_OPENAI_KEY
    organization: my-azure-org           # optional
  local-vllm:
    type: openai
    base_url: http://localhost:8000/v1
    # api_key_env omitted: unauthenticated local endpoint. The runner
    # sends `Authorization: Bearer EMPTY`, which the common
    # self-hosted servers accept.

agents:
  - name: my test
    models:
      - { provider: openrouter, id: anthropic/claude-3.5-sonnet }
      - { provider: azure-prod, id: my-gpt-5 }
    ...

Per-entry fields:

Field	Required	Purpose
`type`	yes	Wire protocol. Today the only supported value is `openai`.
`base_url`	yes	Endpoint base URL the runner POSTs to.
`api_key_env`	no	Environment variable that holds the bearer token. Omit for unauthenticated local endpoints.
`organization`	no	Optional `OpenAI-Organization` header.

Worked example: agent-custom-providers.yml.

`budget` block

Type: object, optional. Per-test and per-suite spend caps applied to agent runs. Both fields are USD cents; missing fields disable that scope.

budget:
  per_test_usd_cents: 50
  per_suite_usd_cents: 500

Field	Required	Purpose
`per_test_usd_cents`	no	Cap on the dollars one agent test can spend. The counter resets between tests.
`per_suite_usd_cents`	no	Cap on the dollars an entire suite can spend. Useful when a matrix run multiplies the per-test spend across N models.

When a cap trips the agent loop stops with a clear error before the provider issues a surprise bill.

`expect` block (the matchers)

Type: array of assertions, optional, default [].

Each assertion combines a target (where to look in the response) with a matcher (how to judge it) and an optional message (human-friendly description used in reporter output).

expect:
  - target: "result.content[0].text"
    matcher:
      contains: "hello"
    message: "tool should greet by name"

Fields per assertion:

target (string, required, no default). JSONPath-style expression that extracts a value from the response. Example: result.content[0].text.
matcher (object, required, no default). One matcher predicate. See the matchers documented below.
message (string, optional, no default). Friendly description used in failure output. Falls back to the matcher type when omitted.
transform (string, optional, no default). A command that rewrites the extracted value before this matcher compares it. The value at target goes to the command's stdin as JSON and the parsed stdout replaces it, so you can lowercase a string, strip a volatile id, or reshape one field without touching the rest of the response. Works on tool and agent tests. See transforms.md for the contract.

expect:
  - target: "result.name"
    transform: jq ascii_downcase
    matcher:
      exact: "echo"

A matcher object carries exactly one key. The valid keys in v1 are exact, contains, regex, schema, snapshot, llm-judge, llm-jury, contains-all, contains-any, icontains, starts-with, is-json, is-valid-tools-call, levenshtein, is-xml, is-sql, similar, cel, factuality, answer-relevance, context-faithfulness, and not. The matcher selection is mutually exclusive at the schema level (minProperties: 1, maxProperties: 1), so the loader rejects a matcher that tries to combine exact: true with regex: "foo".

Every failed assertion produces a structured failure with the following common fields:

{
  "test_name": "echoes the input back",
  "target": "result.content[0].text",
  "matcher": "regex",
  "message": "echo should round-trip the rendered message",
  "expected": "^ping from acct_",
  "actual": "ping from acct_demo"
}

Reporters render this structure into pretty text, JSON, or JUnit XML. The matcher-specific shape is documented per matcher below.

`exact` matcher

Strict structural equality. Compares the extracted value to the matcher argument using JSON equality rules (objects, arrays, primitives).

expect:
  - target: "result.content[0].text"
    matcher:
      exact: "hello, world"

Failure shape:

{
  "matcher": "exact",
  "expected": "hello, world",
  "actual": "hello, world!",
  "diff": "+!"
}

Use exact when you want a deep-equal match. Reach for contains or regex when the response embeds extra text you do not care about.

`contains` matcher

Containment, matched by the type of the value:

Strings: case-sensitive substring. contains: "Sacramento" passes against "It is rainy in Sacramento.". Use icontains for the case-insensitive form and exact for full equality.
Objects: subset. Every key in the matcher value must be present in the actual value and its value must itself satisfy contains (extra keys in the actual value are ignored).
Arrays: multiset. Every element in the matcher value must have a distinct matching element in the actual array, in any order.

Non-string scalars (numbers, booleans, null) compare by equality.

Object subset example (passes even though the actual result has more keys):

expect:
  - target: "result"
    matcher:
      contains:
        isError: false

Array multiset example (passes when every listed element is present):

expect:
  - target: "result.tags"
    matcher:
      contains: ["urgent", "billing"]

Failure shape (object subset, missing key):

{
  "matcher": "contains",
  "path": "/isError",
  "expected": false,
  "note": "missing key `isError` in actual value"
}

See also: exact, icontains, regex.

`regex` matcher

Tests whether the regex pattern matches anywhere in the stringified value. Uses the regex crate's default syntax.

expect:
  - target: "result.content[0].text"
    matcher:
      regex: "^ping from acct_"

Failure shape:

{
  "matcher": "regex",
  "expected": "^ping from acct_",
  "actual": "pong from svc_demo",
  "match": null
}

When the regex compiles but does not match, match is null. When the regex itself is invalid, the loader fails the entire suite at validation time with a pointer to the offending field.

`schema` matcher

Validates the value against an inline JSON Schema. Useful for asserting the shape of a structured tool response without pinning specific values.

expect:
  - target: "result"
    matcher:
      schema:
        type: object
        required: ["content", "isError"]
        properties:
          content:
            type: array
            minItems: 1
          isError:
            type: boolean

Failure shape:

{
  "matcher": "schema",
  "errors": [
    {
      "instance_path": "/isError",
      "schema_path": "/properties/isError/type",
      "message": "expected boolean, got string"
    }
  ]
}

The schema runs through the same jsonschema crate the loader uses, so draft 2020-12 features such as if/then/else, prefixItems, unevaluatedProperties, oneOf/anyOf/allOf, and $defs with internal $ref work as documented. To assert that a value is an array with at least one element (the role length used to play), use type: array with minItems.

Security limits. Two guards run before validation:

External $ref URIs are refused at compile time. Any $ref value that does not start with # (a same-document fragment) fails the matcher with SchemaExternalRef. The validator never reaches the network.
The schema's JSON nesting depth is capped (default 64). A schema deeper than the cap fails with SchemaTooDeep before compile.
Validation iteration runs under a wall-clock budget (default 2 seconds). Exceeding the budget fails the matcher with SchemaValidationTimedOut.

Real-world MCP schemas sit far under these defaults, so a hand-authored suite never trips them. The limits exist to keep a pathological schema or a malicious server response from stalling the runner.

`snapshot` matcher

Records the extracted value to disk on the first run, then deep-compares the extracted value against the recorded copy on every subsequent run. A mismatch fails the test with a readable diff. The shorthand is a string (the snapshot key); the long form is an object with optional flags.

expect:
  - target: "result.content"
    matcher:
      snapshot: "lists-tools-content"

The key resolves to a file under the snapshots/ directory next to the suite YAML: snapshot: "lists-tools-content" writes and reads <suite_dir>/snapshots/lists-tools-content.json. Keys may be nested (snapshot: "tools/echo/baseline" becomes <suite_dir>/snapshots/tools/echo/baseline.json); intermediate directories are created on first write. A .json extension is appended automatically unless the key already carries an extension.

On the first run (or when the snapshot file is missing) the matcher records the current value and passes. On later runs it loads the recorded value and compares; the test fails when they differ. To re-record after an intentional change, run with --update-snapshots (or -u), which rewrites every snapshot the run touches and passes. Review the resulting git diff before committing.

mcptest run --update-snapshots

--update-snapshots is refused when CI=true is set, so a CI job never silently rewrites a golden file. Pass --allow-update-in-ci to override.

Long form (per-test update override):

expect:
  - target: "result.content"
    matcher:
      snapshot:
        name: "lists-tools-content"
        update: false

name (string). Snapshot key, resolved as above.
update (boolean, default false). When true, this test's snapshot is re-recorded regardless of the global flag. Use sparingly, for a single fast-moving test.

Failure shape:

{
  "matcher": "snapshot",
  "note": "snapshot at snapshots/lists-tools-content.json did not match (run with --update-snapshots to refresh)",
  "expected": "...recorded value...",
  "actual": "...current value..."
}

`llm-judge` matcher

Routes the candidate string through an LLM with a grading rubric. The matcher passes when the judge returns pass: true or a score at or above threshold. The judge model is invoked through the same provider lookup the agent driver uses (env-var auto-detection plus named providers), so any provider family configured for the suite is fair game. The literature grounding is in docs/research-references.md.

expect:
  - target: final_response
    matcher:
      llm-judge:
        rubric: |
          The response must mention Sacramento and at least one
          temperature number, and must not invent details the tool
          did not return.
        threshold: 0.8
        model: claude-sonnet-4-5    # optional override

Fields under llm-judge:

rubric (string, required, no default). Grading rubric handed to the judge model. Multi-line is fine.
threshold (number in [0, 1], optional, default 0.7). When the judge's reply carries a score field, the matcher passes when score >= threshold. The reply can also carry pass: true to short-circuit the score comparison.
model (string, optional). Override the judge model id. When omitted the runner uses the same provider the executor was built with (see the with_provider hook on McpExecutor).

Failure note: the matcher's diff carries the score and the judge's one-sentence reason, so the reporter shows you why the judge said no.

`llm-jury` matcher

Like llm-judge, but runs N independent jurors and requires a quorum to pass. Useful when one judge model is itself the subject of the test (so you do not want it grading its own output) or when the assertion is high-stakes enough to want consensus. Reports inter-juror agreement (Krippendorff's alpha) alongside the verdict so a split jury is obvious.

expect:
  - target: final_response
    matcher:
      llm-jury:
        rubric: |
          Reply pass when the response (1) names Sacramento, (2) cites
          the temperature the tool returned, (3) is no longer than two
          sentences.
        jurors:
          - model: claude-sonnet-4-5
          - model: gpt-5
          - model: gemini-2.5-pro
        threshold: 0.7
        quorum: 0.66

Fields under llm-jury:

rubric (string, required). Same shape as llm-judge.rubric.
jurors (array, required, at least one entry). Each entry has a model: field; an optional per-juror threshold: overrides the jury-wide default.
threshold (number, optional, default 0.7). Default per-juror score threshold; jurors whose score meets it count as a pass.
quorum (number in [0, 1], optional, default 0.5). Fraction of eligible jurors that must pass for the jury verdict to be pass. 1.0 requires unanimous; 0.5 is simple majority.

Jurors that error out (network failure, malformed reply) are recorded as abstaining; they do not count for or against quorum. The reporter shows each juror's verdict and the aggregate pass_fraction.

Worked example: examples/agent-llm-judge.yml.

See also: contains, regex, agents block.

`contains-all` matcher

The string value must contain every listed substring, or the array value must contain every listed element. The argument is an array of needles.

expect:
  - target: "result.content[0].text"
    matcher:
      contains-all: ["order", "shipped", "tracking"]

`contains-any` matcher

The string value must contain at least one listed substring, or the array value must contain at least one listed element. An empty list never passes.

expect:
  - target: "result.content[0].text"
    matcher:
      contains-any: ["approved", "accepted"]

`icontains` matcher

Case-insensitive substring containment. The argument is a single string.

expect:
  - target: "result.content[0].text"
    matcher:
      icontains: "SUCCESS"

`starts-with` matcher

The string value must start with the given prefix.

expect:
  - target: "result.content[0].text"
    matcher:
      starts-with: "data:image/png;base64,"

`is-json` matcher

The string value must parse as JSON. Pass ~ (null) for a parse-only check, or an object with a schema key to also validate the parsed document against an inline JSON Schema (the same engine the schema matcher uses).

expect:
  - target: "result.content[0].text"
    matcher:
      is-json:
        schema:
          type: object
          required: ["id", "status"]

`is-valid-tools-call` matcher

The value must be a well-formed MCP tool call: an object with a string name and an arguments object. Pass ~ (null) for a shape-only check, or an object with a schema key to validate the call's arguments against an inline JSON Schema.

expect:
  - target: "tool_calls[0]"
    matcher:
      is-valid-tools-call:
        schema:
          type: object
          required: ["city"]

`levenshtein` matcher

The Levenshtein edit distance between the stringified value and the reference value must be at most max. Useful for "close enough" text comparisons without an embedding model. The distance counts insertions, deletions, and substitutions over Unicode scalar values.

expect:
  - target: "result.content[0].text"
    matcher:
      levenshtein:
        value: "the quick brown fox"
        max: 3

`is-xml` matcher

The string value must parse as well-formed XML. Pass ~ (null) for a well-formedness check, or an object with a root key to also assert the document's root element name. The value is read to EOF with a streaming parser, so any malformed markup (an unclosed tag, mismatched element) fails the assertion.

expect:
  - target: "result.content[0].text"
    matcher:
      is-xml:
        root: "note"

`is-sql` matcher

The string value must parse as SQL. The value is handed to a real SQL parser using the permissive generic dialect, so it accepts the broadest set of statements rather than pinning one engine's grammar. Useful for a "text to SQL" tool whose output should at least be syntactically valid. The matcher takes no options; pass ~ (null).

expect:
  - target: "result.content[0].text"
    matcher:
      is-sql: ~

`similar` matcher

Embedding cosine similarity to a reference string. The matcher embeds both the reference value and the actual value with the named embedding model, then passes when the cosine similarity is at least threshold (a floor in 0..1). Use it when two answers can be worded differently but mean the same thing, where levenshtein (character distance) is too literal.

Like llm-judge, this matcher calls an external model, so it needs a configured provider with an embeddings endpoint (OpenAI text-embedding-3-*, Google, Mistral, or a local Ollama embedding model). Anthropic has no embeddings API, so pointing similar at it fails with a clear unsupported-feature error. Because it is async, similar is not composable under the not: wrapper.

expect:
  - target: "result.content[0].text"
    matcher:
      similar:
        value: "the order shipped and is on its way"
        threshold: 0.85
        model: "text-embedding-3-small"

`cel` matcher

A CEL (Common Expression Language) boolean predicate over the value. Reach for it when the built-in matchers do not express the check you need and you want a deterministic rule rather than an LLM judge. The value at target is bound into the expression as the variable value, so the predicate reads the resolved target directly. Omit target (or set it to "") to bind the whole response envelope as value.

CEL is the same predicate language SBproxy uses, so one expression dialect carries across the gateway and the tester. It evaluates in-process with no model and no network call, so unlike llm-judge it is deterministic and composes under not:. The regex helper is available, so value.matches("^ok") works. The expression must return a boolean; a parse error, an evaluation error, or a non-boolean result fails the run as a suite bug rather than a failed assertion.

Keep the split with transform: in mind: jq under transform: reshapes data (one value in, one value out), while cel: is a boolean predicate. Do not reach for one where the other fits.

expect:
  - target: "result"
    matcher:
      cel: "value.content.size() > 0 && !value.isError"

`factuality`, `answer-relevance`, `context-faithfulness` matchers

Named model-graded matchers with a shared vocabulary for RAG and QA suites. Each is sugar over llm-judge: it carries one field and desugars to a judge call with a vetted, fixed rubric that embeds that field, so the judge plumbing, threshold gating, and calibration all carry over. Like llm-judge, each needs a configured provider, and each takes an optional threshold (default 0.7 on the judge's 0..1 score) and a model override.

factuality judges whether the value is consistent with a reference.
answer-relevance judges how directly the value answers a query.
context-faithfulness judges whether every claim in the value is grounded in a context (it penalizes hallucination).

expect:
  - target: "result.content[0].text"
    matcher:
      factuality:
        reference: "Paris is the capital of France"
  - target: "result.content[0].text"
    matcher:
      answer-relevance:
        query: "What is the capital of France?"
        threshold: 0.8
  - target: "result.content[0].text"
    matcher:
      context-faithfulness:
        context: "France is a country in Europe. Its capital is Paris."

`not` matcher

Universal negation. Wraps another matcher and passes when that matcher fails (and fails when it passes). One not: wrapper composes over every deterministic matcher, so there is no per-type not-contains or not-regex variant to learn. Negation is limited to deterministic matchers; wrapping a stateful matcher (snapshot) or an async/LLM matcher (llm-judge, llm-jury, similar) is an evaluation error.

expect:
  - target: "result.content[0].text"
    matcher:
      not:
        contains: "error"

`oneOf` / `anyOf` / `allOf` composition matchers

JSON Schema 2020-12 ships three composition keywords. mcptest exposes the same composition at the matcher level so a suite can compose any matchers, not just schemas, across one extracted target.

oneOf: passes when exactly one inner matcher passes.

expect:
  - target: "result.content[0].text"
    matcher:
      oneOf:
        - exact: "ok"
        - exact: "ready"

anyOf: passes when at least one inner matcher passes.

expect:
  - target: "result.content[0].text"
    matcher:
      anyOf:
        - contains: "ok"
        - contains: "ready"
        - regex: "^accepted-\\d+$"

allOf: passes when every inner matcher passes. Useful for combining a positive predicate with a negative one without two separate assertions.

expect:
  - target: "result"
    matcher:
      allOf:
        - contains: { isError: false }
        - not:
            contains: { content: [{ text: "" }] }

The body must be a non-empty YAML sequence of matchers. An empty sequence is a load-time error.

Composition wraps only deterministic matchers, the same scope as not:. An inner snapshot, llm-judge, llm-jury, or similar is an evaluation error rather than a silent skip, because a composition that silently dropped a branch would change its truth value without the suite author knowing. Structural errors from inner matchers (a bad regex, a malformed schema) surface unchanged rather than being swallowed as a failing branch.

Compositions nest. allOf: [ anyOf: [ ... ], not: { ... } ] is legal and behaves the obvious way.

Budgets and headers (not matchers)

Four expect-level fields look like matchers in casual conversation but live alongside matcher: inside the expect: block, not inside the matcher: object itself. They apply to the whole step, not to one extracted target. To use them, write the expect: block in its long form: an object with an assertions: array plus any of the fields below.

max_duration_ms (integer). Per-step duration budget. Fails the step when wall-clock duration exceeds the budget. Cache hits skip the check.
max_response_tokens (integer or object). Caps the response token count. Long form takes budget, tokenizer, mode, and image_cost. The schema enforces these field names; see schemas/v1.json for the exact shape.
response_headers (object). For URL servers, asserts that named headers match a string, regex, schema, exists, or contains predicate.
response_headers_absent (array of strings). For URL servers, asserts that the named headers are not present.

expect:
  assertions:
    - target: "result.content[0].text"
      matcher:
        contains: "hello"
  max_duration_ms: 500
  response_headers:
    content-type: "application/json"
  response_headers_absent: ["set-cookie"]

Naming history

An earlier draft of the schema used equals, length, and jsonpath. Those names were dropped before v1.0 ships; the canonical matcher keys are listed above. Configs that still use the old names will fail schema validation with a clear "additional property not allowed" error.

Test styles

mcptest supports three test styles, written to feel familiar whether you think in matchers, in step-by-step protocol traces, or in raw JSON-RPC.

Flat (default)

The flat style is one test per array entry, one tool call per test. This is what every example above uses. It is the style 95 percent of suites should adopt.

tools:
  - name: "lists tools without error"
    server: filesystem
    tool: "list_directory"
    args:
      path: "/tmp"
    expect:
      - target: "result.content"
        matcher:
          schema:
            type: array
            minItems: 1

  - name: "reads a known fixture"
    server: filesystem
    tool: "read_file"
    args:
      path: "${fixture_path}"
    expect:
      - target: "result.content[0].text"
        matcher:
          contains: "hello"

Stepwise

Stepwise tests run multiple tool calls in order, sharing state between steps. The current schema captures stepwise behavior via separate tools: entries that share ${variables}. A dedicated steps: field is on the roadmap (tracked under a follow-up ticket).

For now, model a multi-step interaction by chaining flat tests against the same server and using a variables: entry to thread shared identifiers:

variables:
  resource_id:
    value: "res_abc123"

tools:
  - name: "step 1: creates a resource"
    server: remote_api
    tool: "create_resource"
    args:
      id: "${resource_id}"
    expect:
      - target: "result.isError"
        matcher:
          exact: false

  - name: "step 2: reads the resource back"
    server: remote_api
    tool: "read_resource"
    args:
      id: "${resource_id}"
    expect:
      - target: "result.content[0].text"
        matcher:
          contains: "${resource_id}"

  - name: "step 3: deletes the resource"
    server: remote_api
    tool: "delete_resource"
    args:
      id: "${resource_id}"
    expect:
      - target: "result.isError"
        matcher:
          exact: false

Raw JSON-RPC

Raw JSON-RPC tests sidestep the tool-call abstraction and assert against the literal protocol envelope. The v1 schema models this through compliance: checks. Use the check: field to name a built-in protocol sequence (for example, initialize or tools/list), then assert on the raw response.

compliance:
  - name: "negotiates capabilities on initialize"
    server: filesystem
    check: "initialize"
    expect:
      - target: "result.protocolVersion"
        matcher:
          regex: "^2\\d{3}-\\d{2}-\\d{2}$"
      - target: "result.capabilities"
        matcher:
          schema:
            type: object
            required: ["tools"]

  - name: "advertises required tools"
    server: filesystem
    check: "tools/list"
    expect:
      - target: "result.tools"
        matcher:
          schema:
            type: array
            minItems: 1

A wider raw-frame style (full request envelope under your control, response captured as a JSON value) is on the roadmap. The current built-in checks are documented in compliance block.

Variable interpolation

mcptest interpolates ${name} references in any string field at load time (after schema validation, before the first test runs). The available forms:

${VAR} resolves VAR from the merged variable scope (variables: block plus dotenv plus process env). If VAR is unset, the reference resolves to the empty string and mcptest prints a warning listing every unresolved name (so a typo like ${ACCOUNT_I} does not pass silently). Set MCPTEST_STRICT_VARS=1 to turn that warning into a hard error, which is the recommended setting for CI.
${VAR:-default} resolves VAR if set; otherwise inserts the literal default. The default is a string, not another reference. Use ${VAR:-} (empty default) to declare a reference that is intentionally optional and silence the unset warning.
${VAR:?} resolves VAR if set; otherwise fails the run with a clear error pointing at the field that demanded the value. Use this when a missing variable should halt the suite, for example a required API token.
${capture:NAME} resolves a value captured from an earlier test's response. When the reference is the entire value of a field and the captured value is a JSON object or array, it is injected structurally (as a nested object or array), not as a quoted string, so a captured object can be reused directly as a later tool's argument. Mixed text such as "prefix ${capture:NAME}" keeps the stringified form.
$VAR is the short form, equivalent to ${VAR}. Allowed for one-token references where the next character would not form a valid identifier.
$$ is the literal-dollar escape. Use it when you need a literal $ in the rendered string.

The full precedence table lives at mcptest.sh/docs/secrets-and-variables.

A worked example combining all forms:

variables:
  base_url:
    value: "https://api.example.com"

tools:
  - name: "uses every interpolation form"
    server: remote_api
    tool: "echo"
    args:
      url: "${base_url}/health"
      token: "${MCPTEST_API_TOKEN:?}"
      label: "${ENV_LABEL:-staging}"
      short: "$ACCOUNT_ID"
      literal: "the price is $$5"
    expect:
      - target: "result.isError"
        matcher:
          exact: false

`bearer_token_env` vs `${NAME}`

These look similar and are not the same.

servers:
  api_a:
    url: "https://a.example.com"
    auth:
      bearer_token_env: "API_A_TOKEN"

  api_b:
    url: "https://b.example.com"
    auth:
      bearer_token_env: "${SECRET_VAR_NAME}"

For api_a, mcptest reads the env var named API_A_TOKEN directly and sends its value as a bearer header. The YAML never holds the secret.

For api_b, mcptest first interpolates ${SECRET_VAR_NAME} to find the name of the env var to read, then reads that env var. This is occasionally useful (rotating tokens with versioned names), but most suites should stick to the literal bearer_token_env: NAME form. Reporters know that the value read by bearer_token_env is a credential and redact it from logs. Plain ${NAME} interpolation gets no such treatment.

See also: variables block, auth block.

`imports` block

Type: array of strings, optional, default [].

Each entry is a relative or absolute path to another mcptest YAML file. The loader merges imports in order, with later imports overriding earlier ones, and the current file overriding all of its imports. The intended shape:

imports:
  - "./shared/servers.yml"
  - "./shared/variables.yml"

tools:
  - name: "uses an imported server"
    server: shared_filesystem
    tool: "list_directory"
    args:
      path: "/tmp"

The full import implementation lands in a follow-up ticket. Today the loader recognizes the imports: array (the schema validates it) and prints a clear error if any path fails to resolve. Treat this section as a forward contract: write your suites against the documented shape, and the loader will start respecting them when the implementation lands.

Cycle detection is part of the same follow-up. Until then, do not import a file that imports your file. The loader will reject cycles with a structured error once the feature ships.

`compliance` block

Type: array of objects, optional, default [].

Compliance tests assert that the server speaks the MCP protocol correctly: capability negotiation, error shapes, required method presence. The v1 surface is a small set of built-in checks. The full compliance corpus is on the roadmap and not yet wired up.

compliance:
  - name: "negotiates capabilities on initialize"
    server: filesystem
    check: "initialize"
    expect:
      - target: "result.protocolVersion"
        matcher:
          regex: "^2\\d{3}-\\d{2}-\\d{2}$"

Fields per entry:

name (string, required, no default). Human-readable test name.
server (string, required, no default). Key into the top-level servers map.
check (string, required, no default). Compliance check identifier. v1 ships a small built-in set, with initialize and tools/list as the first two.
expect (array of assertions, optional, default []). Same shape as the tools[].expect block; see expect block.

The intended shape for the full corpus is a curated set of named checks shipped with the binary. Each check encodes a known-good interaction (for example, initialize issues an initialize request, validates the response, captures the server capabilities) and exposes assertion targets into the captured frames. The shape of those targets is the same as for tool tests, so the matchers in this doc apply unchanged.

See also: expect block, servers block.

`evals` block

Type: array of objects, optional, default [].

Evaluations grade a response against a rubric and return a score. An entry names a server and a prompt, supplies a rubric, and gates on a threshold. The rubric is the same shape the agent-side eval.rubric matcher uses, so a free-form string, a weighted criteria list, or a decision tree all work here.

A rubric takes one of three forms:

evals:
  # 1. Free-form string: a single holistic judgment.
  - name: "summary stays on topic"
    server: remote_api
    prompt: "Summarize the latest deployment."
    rubric: "Answer must mention the service name and the release tag."
    threshold: 0.7

  # 2. Weighted criteria: the score is the weight-normalized average of the
  #    per-criterion judgments, each reported with its own reason.
  - name: "booking quality"
    server: calendar
    prompt: "Book a meeting with Alice next Tuesday at 2pm and confirm it."
    rubric:
      threshold: 0.8
      criteria:
        - name: booked the right day
          description: "Created an event on the correct Tuesday."
          weight: 2
        - name: confirmed to the user
          description: "The final reply confirms the booking."

  # 3. Decision tree: one yes/no question per node, ending at a fixed score.
  - name: "weather answered"
    server: weather
    prompt: "What is the weather in Paris?"
    rubric:
      threshold: 0.7
      tree:
        ask: "Does the answer state a temperature?"
        yes: { score: 1.0, reason: "reported a temperature" }
        no:  { score: 0.0, reason: "no temperature" }

Fields per entry:

name (string, required, no default). Human-readable test name.
server (string, required, no default). Key into the top-level servers map.
prompt (string, required, no default). Prompt sent to the model.
rubric (optional). One of: a free-form string; a structured rubric object (criteria or tree, never both, mirroring the eval.rubric matcher, carrying its own threshold and strict); a reference { ref: <name> } to a rubric in the top-level rubrics: map, with optional inline threshold/strict overrides; or a preset { preset: <name> } (one of helpfulness, groundedness, safety, format-adherence, conciseness) with optional overrides and appended criteria.
response (string, optional, no default). A fixed candidate response to grade. When present, the rubric grades this text directly: key-free and reproducible, the CI-safe path. When absent, the eval's prompt runs as a tool-using agent against its server and the whole run is graded; this needs a resolved provider and a reachable server, and defers otherwise.
threshold (number in [0, 1], optional, default 0.7). Pass threshold for the free-form string form. A score at or above the threshold passes.
judge (object, optional). Per-eval judge configuration: a model override, an optional jury: { size, consensus }, or a panel: [model, ...] ensemble with aggregate (mean, median, or majority) and tie_break (pass/fail). A jury grades the rubric size times and passes on a consensus fraction; a panel grades once per model and combines the verdicts. A panel takes precedence over a jury. OSS juries and panels are single-provider. The run header prints the projected judge-call count.
matrix (object, optional). Fan the eval out across models and/or prompts: one cell per combination, each reusing the same rubric, judge, and threshold so the per-cell scores compare apples to apples. Each cell is its own report row, so the comparison renders in every reporter.

Evaluations run via mcptest eval (the dedicated subcommand). The default mcptest run invocation skips them so a basic CI gate stays cheap. The judge model is resolved from the environment; without an API key (or without a response to grade) each eval defers, reported as passed with a note, so a key-free CI run stays green. See Rubric scoring for the full guide.

`scorers` block

Type: array of objects, optional, default []. Top-level key (a sibling of servers: and tools:, not nested under evals:).

A scorer is an external grader that returns the same {verdict, score, reasoning, cost_usd} envelope as the built-in judge, so reporters, --max-cost, and verdict caching stay uniform no matter who scored. The build ships the exec scorer; vendor types (braintrust, langsmith, ...) are not included.

scorers:
  - name: factuality
    type: exec
    command: ["python3", "examples/scorers/braintrust.py"]
    env:
      BRAINTRUST_API_KEY: ${BRAINTRUST_API_KEY}
    timeout_ms: 30000

Fields per entry:

name (string, required). Folded into the verdict cache key.
type (string, required). Open-vocabulary. exec ships in the build; an unknown type parses fine and only fails at resolution with a "scorer type 'X' not available in this build" diagnostic, so the config format is the same whether or not a given scorer type is available.
command (array, required for exec). Argv; the first element is the executable. mcptest writes {input, output, criteria, metadata} as one JSON document to the process stdin and reads {verdict, score, reasoning, cost_usd} from stdout.
env (object, optional). Overlay for the subprocess. ${VAR} references resolve against the process environment at parse time.
cwd (string, optional). Working directory; inherits the parent cwd when omitted.
timeout_ms (integer, optional, default 30000). Per-call timeout.

A cost_usd: null response is accepted (external scorers often cannot compute cost) and does not count toward --max-cost. See the worked walkthrough and reference wrappers in external scorers.

`model_compatibility` block

Type: array of strings, optional, default [].

Pure metadata. Each entry is a model identifier the test suite targets. Reporters use this list to label results when a run is associated with a specific model.

model_compatibility:
  - "claude-opus-4-7"
  - "claude-sonnet-4-5"
  - "gpt-4o"

The full model-compatibility matrix (cross-model pass-fail grids, diff-against-baseline output) is the W8 milestone. v1 only records the list and surfaces it to reporters.

`performance` block

Type: object, optional, no default fields.

Time and duration units

mcptest uses two time conventions, and the field name tells you which:

Fields whose name ends in _ms (default_timeout_ms, timeout_ms, delay_ms, max_duration_ms, p95_latency_ms) take a plain integer number of milliseconds. default_timeout_ms: 30 is 30 milliseconds, not 30 seconds.
Fields named timeout, connect_timeout, and time_budget take a duration string with a unit suffix: ms, s, m, or h (for example 30s, 2m, 500ms).

When in doubt, prefer the duration-string form where the schema allows it, since the unit is explicit. A bare integer in a duration-string field is rejected at load time.

performance holds two soft budgets that apply suite-wide.

performance:
  default_timeout_ms: 10000
  p95_latency_ms: 2000

Fields:

default_timeout_ms (integer at least 1, optional, no default). Default per-test timeout in milliseconds. Individual tools[].timeout_ms overrides this value.
p95_latency_ms (integer at least 1, optional, no default). Soft p95 (95th-percentile) latency budget. Reporters highlight tests that breach the budget; the run still passes unless a test-level timeout fires.

Deeper performance work (load generation, sustained throughput, percentile-over-time output) lives behind a separate mcptest bench command and is out of scope for v1.

Profiles and baseline

Two practical knobs help large suites stay green over time. Both surfaces are stable on the CLI side; the YAML-side handles are in flight.

`profile:` keyword (in flight)

A profile: keyword on individual tests selects which test groups run for a given invocation. The three named profiles in flight are:

quick. Smoke tests only. Aim for under five seconds wall time.
standard. The default. Everything except the slow-and-expensive set.
full. Every test in the file, including evals and any slow integration paths.

The intended shape at the test-entry level:

tools:
  - name: "fast smoke"
    server: local
    tool: "ping"
    profile: "quick"
    expect:
      - target: "result.isError"
        matcher:
          exact: false

  - name: "expensive scenario"
    server: local
    tool: "stress"
    profile: "full"
    args:
      payload_kb: 1024

The CLI side already accepts --profile. The YAML schema entry for profile: is deferred (tracked under a follow-up ticket). Until it lands, filter via --include / --exclude test name patterns on the CLI.

Baselining known failures

There is no run-level expected-failures baseline yet: mcptest run does not take a --baseline flag. The baseline that ships today is the compliance baseline, a list of compliance rule IDs a server is known to fail, consumed by mcptest compliance run --baseline. A new rule failure flips the gate red; a baselined rule that starts passing is flagged so you can trim the file. See compliance-baseline.md for the file shape and the CI workflow.

Named error scenarios

A fixtures block at the top of the file declares reusable error shapes once, then any tool test can trigger one by name via inject_error:. The runner short-circuits the live dispatch path for these tests: when a tool entry sets inject_error: <name>, the executor synthesizes the {"result": {"jsonrpc":"2.0","error":{...}}} envelope from the named fixture and feeds it through the assertion pipeline as if the server had returned it. No request crosses the wire, no transport is spawned, no cassette is written. This lets a suite exercise the same failure paths (rate limits, expired tokens, 5xx blips) across many tool calls without standing up a stub server.

fixtures:
  server: mock_github
  errors:
    - name: rate_limited
      code: -32000
      message: "GitHub API rate limit exceeded"
      tool: create_issue
    - name: auth_expired
      code: -32001
      message: "OAuth token expired"
      applies_to: any

tools:
  - name: "handles rate limiting gracefully"
    server: mock_github
    tool: create_issue
    inject_error: rate_limited
    args:
      title: "Triage queue overflow"
    expect:
      - target: "result.isError"
        matcher:
          exact: true

Fields under fixtures:

server (string, optional, no default). Key into the top-level servers map identifying which server the fixtures attach to. Metadata only in v1; reporters use it for context.
errors (array of objects, optional, default []). Named error scenarios. Each entry is shaped like:
- name (string, required, no default). Unique identifier referenced by inject_error: from a tool test. Must be unique within the file.
- code (integer, required, no default). JSON-RPC error code returned in place of a successful tool result.
- message (string, required, no default). Human-readable error message echoed back to the caller.
- tool (string, one-of, no default). Scope the error to a single tool name. Mutually exclusive with applies_to.
- applies_to ("any", one-of, no default). Allow the error to be injected for any tool on the server. Mutually exclusive with tool.

Exactly one of tool or applies_to must be present. The schema rejects entries that set both or neither.

On the test side, inject_error: on a tool entry names a fixture:

inject_error (string, optional, no default). The name of an entry under fixtures.errors[]. The schema does not enforce that the name resolves to a declared fixture; that cross-reference check is the loader's job and produces the user-friendly error message when a name is misspelled.

The synthesized envelope matches the shape a real MCP server returns for a JSON-RPC error: a top-level result wrapper (so dotted assertion paths rooted at result. resolve), with jsonrpc: "2.0" and an error object that carries code and message. Authoring an assertion against an injected fixture works exactly like asserting against a real failure:

expect:
  - target: "result.error.code"
    matcher:
      exact: -32000
  - target: "result.error.message"
    matcher:
      icontains: "rate limit"

Worked examples ship at examples/named-errors-stdio.yml and examples/named-errors-url.yml.

Compositions

A compositions: block declares a directed acyclic graph (DAG) of tool calls that runs in topological order. Each node names its parents in needs:, and a ${id.field} reference whose id is not in needs: is a load-time error. See compositions.md for the full reference and worked example.

compositions:
  - name: render-top-readme
    nodes:
      - id: search
        tool: search_packages
        args: { query: "mcp" }
      - id: readme
        needs: [search]
        tool: get_readme
        args:
          package: "${search.top_hit.name}"
    output: readme
    expect:
      - target: "nodes.readme.status"
        matcher:
          exact: "ok"

Assertion targets are addressable: composition.ran, composition.order, nodes.<id>.status, nodes.<id>.output, nodes.<id>.duration_ms, and output.* (the combined output from the output: node, or the last node in topological order when output: is omitted).

Cassette references

Cassettes record live MCP traffic to disk so a test can replay it deterministically later. The v1 cassette implementation is in flight in the mcptest-cassette crate. The YAML surface (a cassette: field on a tool test that points to a captured frame set) is documented in a separate reference once the implementation stabilizes.

Until cassettes ship, run against the live server. The schema does not currently validate a cassette: key; do not add one yet, the loader will reject it as unknown.

Putting it all together

A larger example exercising most of the v1 surface:

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json

servers:
  filesystem:
    command: ["npx", "-y", "@modelcontextprotocol/server-filesystem", "/tmp"]
    env:
      LOG_LEVEL: "info"

  remote_api:
    url: "https://mcp.example.com/v1"
    auth:
      bearer_token_env: "MCPTEST_REMOTE_API_TOKEN"
    wait_for_ready: "https://mcp.example.com/healthz"

variables:
  fixture_path:
    value: "/tmp/mcptest-fixture.txt"
  account_id:
    from_env: "MCPTEST_ACCOUNT_ID"
    default: "acct_demo"

tools:
  - name: "lists tools without error"
    server: filesystem
    tool: "list_directory"
    args:
      path: "/tmp"
    expect:
      - target: "result.content"
        matcher:
          schema:
            type: array
            minItems: 1
        message: "directory listing should not be empty"

  - name: "reads a known fixture"
    server: filesystem
    tool: "read_file"
    args:
      path: "${fixture_path}"
    expect:
      - target: "result.content[0].text"
        matcher:
          contains: "hello"

  - name: "echoes the input back"
    server: remote_api
    tool: "echo"
    args:
      message: "ping from ${account_id}"
    expect:
      - target: "result.content[0].text"
        matcher:
          regex: "^ping from acct_"

compliance:
  - name: "negotiates capabilities on initialize"
    server: filesystem
    check: "initialize"

  - name: "advertises required tools"
    server: remote_api
    check: "tools/list"
    expect:
      - target: "result.tools"
        matcher:
          schema:
            type: array
            minItems: 1

evals:
  - name: "summary stays on topic"
    server: remote_api
    prompt: "Summarize the latest deployment."
    rubric: "Answer must mention the service name and the release tag."
    threshold: 0.7

model_compatibility:
  - "claude-opus-4-7"
  - "claude-sonnet-4-5"

performance:
  default_timeout_ms: 15000
  p95_latency_ms: 2000

This file:

Defines one stdio server (filesystem) and one URL server (remote_api) with bearer-token auth and a readiness probe.
Declares two variables, one literal and one env-backed with a default.
Runs three tool tests covering the schema, contains, and regex matchers, with a mix of literal and interpolated arguments.
Runs two compliance checks, one with no extra assertions and one with a schema assertion that requires at least one tool entry.
Declares one evaluation (skipped by mcptest run, executed by mcptest eval).
Labels the suite with two model identifiers.
Sets a 15-second default timeout and a 2-second p95 latency budget.

Run it with:

mcptest validate path/to/file.yml
mcptest run path/to/file.yml

validate runs schema validation only. run validates first, then executes every test in the file.

Field index

A quick cross-reference, alphabetical by full path.

Path	Type	Required	Default	Section
`compliance[].check`	string	yes	`-`	`compliance`
`compliance[].expect[]`	array	no	`[]`	`expect`
`compliance[].name`	string	yes	`-`	`compliance`
`compliance[].server`	string	yes	`-`	`compliance`
`evals[].name`	string	yes	`-`	`evals`
`evals[].prompt`	string	yes	`-`	`evals`
`evals[].rubric`	string	no	`-`	`evals`
`evals[].server`	string	yes	`-`	`evals`
`evals[].threshold`	number	no	`0.7`	`evals`
`fixtures.server`	string	no	`-`	Named error scenarios
`fixtures.errors[].name`	string	yes	`-`	Named error scenarios
`fixtures.errors[].code`	integer	yes	`-`	Named error scenarios
`fixtures.errors[].message`	string	yes	`-`	Named error scenarios
`fixtures.errors[].tool`	string	yes (one of)	`-`	Named error scenarios
`fixtures.errors[].applies_to`	`"any"`	yes (one of)	`-`	Named error scenarios
`imports[]`	string	no	`[]`	`imports`
`model_compatibility[]`	string	no	`[]`	`model_compatibility`
`performance.default_timeout_ms`	integer	no	`-`	`performance`
`performance.p95_latency_ms`	integer	no	`-`	`performance`
`servers.<name>.command`	array	yes (stdio)	`-`	`servers`
`servers.<name>.env`	object	no	`{}`	`servers`
`servers.<name>.url`	string	yes (url)	`-`	`servers`
`servers.<name>.auth.bearer_token_env`	string	yes (bearer)	`-`	`auth`
`servers.<name>.auth.oauth.client_id_env`	string	yes (oauth)	`-`	`auth`
`servers.<name>.auth.oauth.authorization_url`	string	yes (oauth)	`-`	`auth`
`servers.<name>.auth.oauth.token_url`	string	yes (oauth)	`-`	`auth`
`servers.<name>.auth.oauth.scopes`	array	no	`[]`	`auth`
`servers.<name>.wait_for_ready`	string	no	`-`	`servers`
`tools[].args`	object	no	`{}`	`tools`
`tools[].expect[]`	array	no	`[]`	`expect`
`tools[].expect[].matcher.contains`	any	yes (one of)	`-`	`contains`
`tools[].expect[].matcher.exact`	any	yes (one of)	`-`	`exact`
`tools[].expect[].matcher.llm-judge`	object	yes (one of)	`-`	`llm-judge`
`tools[].expect[].matcher.regex`	string	yes (one of)	`-`	`regex`
`tools[].expect[].matcher.schema`	object	yes (one of)	`-`	`schema`
`tools[].expect[].matcher.snapshot`	string or object	yes (one of)	`-`	`snapshot`
`tools[].expect[].message`	string	no	`-`	`expect`
`tools[].expect[].target`	string	yes	`-`	`expect`
`tools[].inject_error`	string	no	`-`	Named error scenarios
`tools[].name`	string	yes	`-`	`tools`
`tools[].server`	string	yes	`-`	`tools`
`tools[].timeout_ms`	integer	no	`-`	`tools`
`tools[].tool`	string	yes	`-`	`tools`
`variables.<name>.default`	string	no	`-`	`variables`
`variables.<name>.from_env`	string	no (one of)	`-`	`variables`
`variables.<name>.value`	scalar	no (one of)	`-`	`variables`

A field marked "yes (one of)" is required only when its sibling siblings are absent, per the oneOf constraints in the schema.

Deferred and in-flight summary

For quick reference, every field this page documents that is not yet validated by schemas/v1.json:

Field or feature	Status
`servers.<name>.auth.headers`	deferred to a future release
`servers.<name>.auth.mtls`	deferred to a future release
`servers.<name>.headers`	schema accepts, runner wires up
`servers.<name>.http`	schema accepts, runner wires up
`servers.<name>.http.mtls`	deferred to a future release
`tools[].data` (parametric inputs)	in flight
`tools[].expect[].max_duration_ms`	schema accepts, runner wires up
`tools[].expect[].max_response_tokens`	schema accepts, runner wires up
`tools[].expect[].response_headers`	schema accepts, runner wires up
`tools[].expect[].response_headers_absent`	schema accepts, runner wires up
`tools[].profile`	in flight
`tools[].cassette`	in flight (mcptest-cassette crate)
`fixtures.errors[]` model	shipped
`tools[].inject_error` injection	shipped
Compliance corpus expansion	in flight
Container-managed servers	a future release
Full eval grader	in flight
Baseline expected-failures file	in flight

If you find a field this page does not cover that the schema does validate, that is a doc bug; file an issue against the mcptest project.

YAML test format reference (v1)

Overview

Top-level keys

target_versions block

servers block

Subprocess (stdio) servers

CLI server-target overrides

URL (HTTP or SSE) servers

Cassette (replay) servers

auth block

Bearer token from env

OAuth 2.0 client credentials

Deferred auth fields

Custom headers and HTTP transport

server.headers

server.http

CLI flags for headers and HTTP transport

Worked examples

variables block

tools block

tags (optional, on every test type)

transform (optional, on tool tests)

resources block

prompts block

agents block

Multi-turn conversations

eval (judged metrics)

faults block

providers block

budget block

expect block (the matchers)

exact matcher

contains matcher

regex matcher

schema matcher

snapshot matcher

llm-judge matcher

llm-jury matcher

contains-all matcher

contains-any matcher

icontains matcher

starts-with matcher

is-json matcher

is-valid-tools-call matcher

levenshtein matcher

is-xml matcher

is-sql matcher

similar matcher

cel matcher

factuality, answer-relevance, context-faithfulness matchers

not matcher

oneOf / anyOf / allOf composition matchers

Budgets and headers (not matchers)

Naming history

Test styles

Flat (default)

Stepwise

Raw JSON-RPC

Variable interpolation

bearer_token_env vs ${NAME}

imports block

compliance block

evals block

scorers block

model_compatibility block

performance block

Time and duration units

Profiles and baseline

profile: keyword (in flight)

Baselining known failures

Named error scenarios

Compositions

Cassette references

Putting it all together

Field index

Deferred and in-flight summary

See also

`target_versions` block

`servers` block

`auth` block

`server.headers`

`server.http`

`variables` block

`tools` block

`tags` (optional, on every test type)

`transform` (optional, on tool tests)

`resources` block

`prompts` block

`agents` block

`eval` (judged metrics)

`faults` block

`providers` block

`budget` block

`expect` block (the matchers)

`exact` matcher

`contains` matcher

`regex` matcher

`schema` matcher

`snapshot` matcher

`llm-judge` matcher

`llm-jury` matcher

`contains-all` matcher

`contains-any` matcher

`icontains` matcher

`starts-with` matcher

`is-json` matcher

`is-valid-tools-call` matcher

`levenshtein` matcher

`is-xml` matcher

`is-sql` matcher

`similar` matcher

`cel` matcher

`factuality`, `answer-relevance`, `context-faithfulness` matchers

`not` matcher

`oneOf` / `anyOf` / `allOf` composition matchers

`bearer_token_env` vs `${NAME}`

`imports` block

`compliance` block

`evals` block

`scorers` block

`model_compatibility` block

`performance` block

`profile:` keyword (in flight)