YAML test format reference (v1)
A complete walk-through of every field mcptest reads from a test YAML file. The JSON Schema at schemas/v1.json is the source of truth. This page documents the v1 surface the schema actually validates today and flags the fields that are in flight or deferred so you can plan around them without guessing.
Overview
An mcptest configuration is a single YAML file (or a small tree of files joined via imports) that describes:
- Which MCP servers to talk to.
- Which tool calls, compliance checks, and evaluations to run.
- How to judge the responses.
Every file should start with the YAML language server directive so editors that understand JSON Schema can autocomplete fields, surface inline help, and flag typos before you run the test:
# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json
The CLI runs the same schema at load time. mcptest validate <file> loads the YAML, applies schemas/v1.json via the jsonschema crate, and prints errors with JSON pointers into the offending fields. The loader rejects unknown keys at the top level and inside every nested object that uses additionalProperties: false in the schema, so a typo in varables: fails the run instead of silently doing nothing.
A minimal file looks like this:
# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json
servers:
local:
command: ["./target/debug/my-mcp-server"]
tools:
- name: "lists tools without error"
server: local
tool: "list_directory"
args:
path: "/tmp"
expect:
- target: "result.content"
matcher:
schema:
type: array
minItems: 1
The rest of this page walks each block top down. Every field entry lists type, whether it is required, the default, a short snippet, and links to related sections.
Top-level keys
| Key | Required | Purpose |
|---|---|---|
servers | yes | Named MCP servers under test. See servers block. |
imports | no | Other YAML files to merge in. See imports block. |
variables | no | Author-defined variables for ${name} interpolation. See variables block. |
tools | no | Array of tool-call tests. See tools block. |
resources | no | Array of resource-read tests. See resources block. |
prompts | no | Array of prompt-get tests. See prompts block. |
agents | no | Agent end-to-end tests against one or more models. See agents block. |
faults | no | Named fault injections an agent test references via inject:. See faults block. |
providers | no | Custom OpenAI-compatible provider definitions. See providers block. |
budget | no | Per-test and per-suite spend caps for agent runs. See budget block. |
compliance | no | Protocol-level checks against the server. See compliance block. |
evals | no | Rubric or model-graded evaluations. See evals block. |
rubrics | no | Reusable named rubrics referenced from evals via rubric: { ref: <name> }. See evals block. |
model_compatibility | no | Metadata list of model identifiers the suite targets. See model_compatibility block. |
performance | no | Per-test and suite-wide latency budgets. See performance block. |
target_versions | no | Run the suite against more than one MCP protocol revision. See target_versions block. |
Anything else at the top level fails validation. If you need a custom field for a future feature, file an issue before adding it.
target_versions block
Type: array of protocol-revision strings. Optional. .
A suite that lists target_versions: declares that it wants to run against every listed MCP revision in turn, not just the runner's default. A server in a deprecation window can be validated against both the old and the new contracts in a single command, and the reporter aggregates per-version pass/fail.
target_versions:
- "2025-11-25"
- "2026-07-28"
servers:
fs:
command: ["./fs-server"]
The accepted values are the wire strings listed in docs/stateless-transport.md: 2024-11-05, 2025-03-26, 2025-06-18, 2025-11-25, 2026-03-26, 2026-07-28. An unknown value is a load-time error so a typo never silently selects a session-based fallback. Duplicates are de-duped while preserving declaration order so the reporter prints per-version results in the order the suite listed them.
CLI override. mcptest run --target-version <V> selects a single entry from the suite's list, so a CI matrix can keep one job per protocol revision without sharding the suite. The override must parse via ProtocolVersion::from_wire and, when the suite declares a list, must be one of its entries; mismatched overrides fail with a pointer at the declared list. When the suite declares no list, the override still works as a one-off "run against this revision only" knob.
Selector library. The pure mcptest_core::target_versions::effective(suite_versions, cli_override) function returns the list the runner walks. The runner integration that calls it (per-version fan-out + per-version report aggregation) is the next focused commit on this work; the selector ships today so the CLI flag and the integration land in isolation.
servers block
Type: object, required, at least one entry.
Each key under servers is the name a test block refers to via its server: field. The value is one of three shapes: a subprocess specification (command:), a URL specification (url:), or a cassette specification (cassette:) that replays a recording instead of reaching a live server. Exactly one of the three must be present; the schema enforces this with oneOf.
Subprocess (stdio) servers
Use a subprocess server when the MCP server is a local binary, an npx package, or anything that speaks MCP over stdio.
servers:
filesystem:
command: ["npx", "-y", "@modelcontextprotocol/server-filesystem", "/tmp"]
env:
LOG_LEVEL: "info"
Fields:
command(array of strings, required, no default). The argv to spawn. First element is the executable; remaining elements are arguments. Must contain at least one entry.env(object of string to string, optional, default{}). Extra environment variables passed to the child process. Values may reference parent env via${VAR}interpolation (see Variable interpolation).
Snippet for a Rust binary built in the workspace:
servers:
local:
command: ["./target/debug/my-mcp-server", "--config", "./fixtures/test.toml"]
env:
RUST_LOG: "debug"
See also: Variable interpolation, tools block.
CLI server-target overrides
Four global CLI flags rewrite the server: block at runtime so the same YAML can target a freshly-deployed preview or CI environment whose URL or command is not knowable at authoring time. The flags live on the global parser, so they go before the subcommand: mcptest --server-url https://preview run.
| Flag | Effect |
|---|---|
--server-url <URL> | Replace server.url. If the YAML had a command:, it is removed. Mutually exclusive with --server-command. |
--server-command <CMD> | Replace server.command. <CMD> is split with POSIX shell rules (shell-words), so --server-command "./dev --debug" parses to ["./dev", "--debug"]. Mutually exclusive with --server-url. |
--server-auth-bearer-env <NAME> | Set server.auth.bearer_token_env for URL targets. |
--server-config <PATH> | Load a YAML file containing a full server: block and use it in place of the in-suite block. |
Precedence (lowest to highest):
- The YAML file's own
server:block. --server-config(full-block replacement).--server-url,--server-command,--server-auth-bearer-env(single-field overrides). A single-field flag wins over the same field in--server-config.
When any override flag is set, the runner prints a one-line banner naming the flags before it runs so the operator can spot a misconfigured CI job without re-reading the YAML.
URL (HTTP or SSE) servers
Use a URL server when the MCP server is already running, lives in a container, or is reachable over the network.
servers:
remote_api:
url: "https://mcp.example.com/v1"
auth:
bearer_token_env: "MCPTEST_REMOTE_API_TOKEN"
wait_for_ready: "https://mcp.example.com/healthz"
Fields:
url(string, required, no default). Full URL to the MCP endpoint. Must parse as a URI. Scheme ishttp(s)for HTTP transport orhttp(s)with SSE content negotiation.auth(object, optional, no default). How to authenticate. Seeauthblock.wait_for_ready(string, optional, no default). URL or path the runner polls until it returns a2xxresponse before sending MCP requests. Use this to coordinate with slow-starting containers.
The headers and mtls keys are reserved on the auth object for a future release and later. They are not validated against any shape in v1, and the loader prints a warning if it sees them. See the deferred notes in auth block.
Container-managed servers are documented briefly only. Full container support (lifecycle, networking, healthchecks) is a future release. Until then, run your container yourself and point a URL server at it.
See also: auth block, Variable interpolation.
Cassette (replay) servers
Use a cassette server to replay a recorded MCP server instead of reaching a live one. The run never touches the network: every request is matched against the recorded exchanges and the recorded response is returned. This is the offline, deterministic, key-free path for CI.
servers:
recorded:
cassette: ./cassettes/issues-server.json
Fields:
cassette(string, required). Path to the cassette JSON file, resolved relative to the working directory.${VAR}interpolation is supported.
cassette: is mutually exclusive with command: and url:; setting more than one is rejected at load time. See Cassettes for the file format and how matching works.
auth block
Authentication strategy for URL servers. Exactly one of bearer_token_env or oauth must be present in v1.
Bearer token from env
servers:
remote_api:
url: "https://mcp.example.com/v1"
auth:
bearer_token_env: "MCPTEST_REMOTE_API_TOKEN"
Fields:
bearer_token_env(string, required, no default). Name of the environment variable that holds the bearer token. mcptest sendsAuthorization: Bearer ${value}wherevalueis read from that env var at request time. The token never appears in the YAML file.
Note the distinction between bearer_token_env: NAME and ${NAME}. The former tells mcptest "read this env var into the Authorization header." The latter is generic string interpolation usable in any string field. Use bearer_token_env for auth credentials; the loader applies a redaction pass to those values in reporter output. Use ${NAME} for everything else. This is covered again in Variable interpolation.
OAuth 2.0 client credentials
servers:
remote_api:
url: "https://mcp.example.com/v1"
auth:
oauth:
client_id_env: "MCPTEST_OAUTH_CLIENT_ID"
authorization_url: "https://auth.example.com/oauth/authorize"
token_url: "https://auth.example.com/oauth/token"
scopes:
- "mcp:read"
- "mcp:invoke"
Fields under oauth:
client_id_env(string, required, no default). Env var holding the OAuth client id. The matching client secret is read from${client_id_env}_SECRETby convention (so aclient_id_envofMCPTEST_OAUTH_CLIENT_IDresolves the secret fromMCPTEST_OAUTH_CLIENT_ID_SECRET).authorization_url(string, required, no default). Authorization endpoint per RFC 6749 section 3.1.token_url(string, required, no default). Token endpoint per RFC 6749 section 3.2.scopes(array of non-empty strings, optional, default[]). Scopes to request. Omit when the authorization server has a default scope.
Deferred auth fields
headersonauth(object of string to string, deferred to a future release). Will accept a map of header names to env-backed values. The schema is permissive so configs that opt in early do not break loading. The loader logs a warning today. Custom headers for non-credential use cases live on the URL server block asserver.headers, see Custom headers and HTTP transport.mtlsonauth(object, deferred to a future release). Will accept paths to client cert and key plus an optional CA bundle. Same permissive shape, same warning. Server-side TLS knobs (insecure_skip_verify, custom CA bundle, minimum protocol version) ship today underserver.http.tls.
See also: Variable interpolation, tools block.
Custom headers and HTTP transport
URL servers accept two optional blocks that configure the HTTP transport without touching auth:. The server.headers block sends custom headers on every request, and the server.http block tunes timeouts, redirect handling, and basic TLS knobs. mTLS (client certificates) is deferred to a future release; the model carries verify on/off, CA bundle, and minimum protocol version only.
server.headers
Type: object, optional, default {}.
Each entry maps a header name to a value. The value is either a literal string (with ${VAR} interpolation against variables) or an object {env: NAME} that the runner reads from the environment at connect time. Pick the env form for secret-flavored headers so they never land in the YAML file.
servers:
saas_api:
url: "https://api.example.com/mcp"
headers:
X-Tenant: "acme"
X-API-Key:
env: ACME_TENANT_API_KEY
X-Trace-Id: "${request_id}"
Fields per entry:
- Literal string (string, no nested fields). Sent verbatim on the wire after
${VAR}interpolation runs. - Object
{env: NAME}(one keyenv, required). ReadsNAMEfrom the environment at connect time. Fails the run with a clear error ifNAMEis unset.
Authorization and Proxy-Authorization are rejected here. Use the auth: block for bearer tokens and OAuth, see auth block. Header names must be valid RFC 7230 token characters (letters, digits, and the punctuation set ! # $ % & ' * + - . ^ _ | ~`). Names with spaces or colons fail validation.
server.http
Type: object, optional, default {}.
Controls HTTP transport behavior for URL servers. Stdio servers ignore the block. Duration fields take a number with a suffix: 30s, 500ms, 1m, 1h. Bare numbers without a unit are rejected so a future reader can tell what the value means.
servers:
saas_api:
url: "https://api.example.com/mcp"
http:
timeout: 30s
connect_timeout: 5s
max_redirects: 5
user_agent_override: true
tls:
insecure_skip_verify: false
ca_bundle_path: "/etc/ssl/internal-ca.pem"
min_version: "1.2"
Fields:
timeout(duration string, optional, default30s). Per-request timeout. The runner aborts the request when the body has not fully arrived in this window.connect_timeout(duration string, optional, default5s). TCP and TLS connect timeout. Fires beforetimeoutdoes on a stuck connection.max_redirects(integer 0 to 50, optional, default5). Maximum number of redirects the runner follows on a single request. Set to0to disable redirects.user_agent_override(boolean, optional, defaulttrue). When true, the runner sets aUser-Agentheader identifying mcptest. Disable when an upstream gateway gates on a specific user-agent string.tls(object, optional, default{}). Basic TLS knobs, documented next.
Fields under tls:
insecure_skip_verify(boolean, optional, defaultfalse). Skip certificate verification. Dangerous; only use against a private staging endpoint with a self-signed certificate. The CLI flag--insecure-skip-verifysets this for the whole run and prints a WARNING banner on stderr.ca_bundle_path(string, optional, no default). Path to a PEM-encoded CA bundle. Use when an internal certificate authority issues your endpoint's certificate.min_version("1.2"or"1.3", optional, default"1.2"). Minimum TLS protocol version. The runner refuses to negotiate below this floor.
CLI flags for headers and HTTP transport
Five global CLI flags map to the same surface so the same YAML can target a preview environment without editing the file. All flags are repeatable where applicable.
| Flag | Effect |
|---|---|
--header NAME=VALUE | Append a literal header. Repeatable. Rejects Authorization and Proxy-Authorization. |
--header-env NAME=VAR | Append an env-backed header. Repeatable. Reads VAR from the environment at connect time. |
--insecure-skip-verify | Disable TLS verification (loud WARNING banner). |
--ca-bundle PATH | Path to a PEM-encoded CA bundle. |
--http-timeout SECONDS | Override server.http.timeout. |
--connect-timeout SECONDS | Override server.http.connect_timeout. |
Precedence (lowest to highest):
server.headersandserver.httpfrom the YAML file.--header,--header-env,--ca-bundle,--http-timeout,--connect-timeout,--insecure-skip-verify. CLI flags overwrite YAML values per header name (case-insensitive).
Worked examples
Cloudflare Access:
servers:
protected:
url: "https://mcp.internal.example.com/v1"
headers:
CF-Access-Client-Id:
env: CF_ACCESS_CLIENT_ID
CF-Access-Client-Secret:
env: CF_ACCESS_CLIENT_SECRET
Google Cloud IAP:
servers:
gcp:
url: "https://iap.example.com/mcp"
headers:
Proxy-Authorization-IAP:
env: GCP_IAP_TOKEN
http:
timeout: 60s
AWS API Gateway with an API key:
servers:
awsg:
url: "https://abc123.execute-api.us-east-1.amazonaws.com/prod/mcp"
headers:
x-api-key:
env: AWSG_API_KEY
Multi-tenant SaaS with a tenant header:
servers:
saas:
url: "https://api.example.com/mcp"
headers:
X-Tenant: "acme"
X-Plan: "enterprise"
Distributed tracing:
servers:
traced:
url: "https://api.example.com/mcp"
headers:
X-Trace-Id: "${run_id}"
traceparent: "${otel_traceparent}"
Internal CA bundle (corporate intranet endpoint):
servers:
internal:
url: "https://mcp.corp.example.com/v1"
http:
tls:
ca_bundle_path: "/etc/ssl/corp-internal-ca.pem"
min_version: "1.3"
mTLS (client certificates) is deferred to a future release; this section will gain an mtls: block alongside tls: when that work lands.
variables block
Type: object, optional, default {}.
variables holds author-defined values usable inside any string field via ${name} interpolation. Each entry is either a literal value or a reference from_env, never both.
variables:
fixture_path:
value: "/tmp/mcptest-fixture.txt"
account_id:
from_env: "MCPTEST_ACCOUNT_ID"
default: "acct_demo"
Fields per entry:
value(string, number, or boolean, optional, no default). Literal value. Use this for paths, identifiers, and other static data.from_env(string, optional, no default). Name of an environment variable to read. Mutually exclusive withvalue.default(string, optional, no default). Fallback value whenfrom_envis unset. Only meaningful alongsidefrom_env.
Resolution precedence is documented at mcptest.sh/docs/secrets-and-variables. The short version, in order from highest to lowest precedence:
- Process environment variables (
export VAR=...or whatever your shell or CI runner set up). - Values written to a loaded dotenv file (
.envnext to the test file, or a file passed via--env-file). - The
default:field on afrom_envvariable. - The
value:field on a literal variable.
Variables resolve once when the configuration loads. References that fail to resolve (env var unset, no default) raise a structured error before any test runs.
variables:
api_base:
value: "https://staging.example.com"
auth_token:
from_env: "MCPTEST_TOKEN"
default: "dev-only-token"
tools:
- name: "fetches a resource"
server: remote_api
tool: "fetch"
args:
url: "${api_base}/resources/42"
See also: Variable interpolation, servers block.
tools block
Type: array of objects, optional, default [].
Each entry is a tool-call test: invoke a single tool on a named server and run assertions against the result.
tools:
- name: "lists tools without error"
server: filesystem
tool: "list_directory"
args:
path: "/tmp"
expect:
- target: "result.content"
matcher:
schema:
type: array
minItems: 1
timeout_ms: 5000
Fields per entry:
name(string, required, no default). Human-readable test name. Appears in reporter output and JUnit summaries.server(string, required, no default). Key into the top-levelserversmap. The test fails to load if the name is not registered.tool(string, required, no default). Tool name the server exposes. The runner checks the name againsttools/listwhen available.args(object, optional, default{}). JSON-serializable arguments passed to the tool. Values may use${name}interpolation againstvariables.expect(array of assertions, optional, default[]). Assertions evaluated against the tool result. Seeexpectblock.timeout_ms(integer, optional, no default). Per-test timeout override in milliseconds. Falls back toperformance.default_timeout_mswhen unset. Seeperformanceblock.
Parametric inputs (a data: array that runs the same test once per row) are deferred. The intended shape is one entry per row, with each row substituted into the args block via ${data.field} interpolation. Until that lands, duplicate the test or use a small shell loop.
A worked example using both literal and env-backed variables:
variables:
greeting:
value: "hello"
who:
from_env: "MCPTEST_GREET_WHO"
default: "world"
tools:
- name: "echoes a greeting"
server: local
tool: "echo"
args:
message: "${greeting}, ${who}"
expect:
- target: "result.content[0].text"
matcher:
exact: "hello, world"
message: "echo should round-trip the rendered message"
See also: expect block, variables block, Test styles.
tags (optional, on every test type)
Type: array of strings. Optional. Available on tool tests, resource tests, prompt tests, and agent tests.
tools:
- name: search returns at least one result
server: prod-api
tool: search
args:
query: "weather"
tags: ["smoke", "search"]
expect:
- target: "result.content"
matcher:
contains: "weather"
mcptest run --tag smoke keeps tests whose tag list contains smoke. mcptest run --skip-tag slow drops tests whose tag list contains slow. Multiple --tag flags are OR'd; --skip-tag wins when a test matches both.
For agent tests the tag applies to every (agent, model) row the matrix expands to, so --tag smoke keeps or drops every model under that agent as a unit. See cli-reference.md for the full flag matrix.
transform (optional, on tool tests)
Type: object with optional request and response command strings. Optional. Tool tests only.
A transform is a subprocess that rewrites the outbound request before it is sent, the response before assertions run, or both. The command reads one JSON value on stdin and writes the replacement value on stdout, so plain jq, node, or python work without any framing protocol.
tools:
- name: search with normalized response
server: prod-api
tool: search
args:
query: "weather"
transform:
request: jq '.arguments.query |= ascii_downcase'
response: ./transforms/strip-ids.sh
expect:
- target: result.content[0].text
matcher:
icontains: weather
The request command receives { "name": ..., "arguments": ... }; the response command receives { "result": ... }. A non-zero exit, a timeout, or unparseable stdout fails the test. You can set a default transform under defaultTest: and override it per test. See transforms.md for the full contract, the environment context, and worked examples in jq, Node, and Python.
resources block
Type: array of objects, optional, default [].
Each entry is a resource-read test: read a single resource by URI from a named server (resources/read) and run assertions against the result envelope.
resources:
- name: "readme resource is text"
server: filesystem
uri: "file:///workspace/README.md"
expect:
- target: "result.contents[0].mimeType"
matcher:
icontains: "text"
Fields per entry:
name(string, required). Human-readable test name.server(string, required). Key into the top-levelserversmap.uri(string, required). Resource URI to read viaresources/read.expect(array, optional). Assertions against the result. The result shape is{ "contents": [{ "uri", "mimeType"?, "text" }] }, so targets read likeresult.contents[0].text.tags,threshold, andderivedMetricswork the same as on tool tests.
prompts block
Type: array of objects, optional, default [].
Each entry is a prompt-get test: fetch a single prompt by name from a named server (prompts/get) and run assertions against the result envelope.
prompts:
- name: "bug-triage prompt renders"
server: filesystem
prompt: bug_triage
args:
severity: high
expect:
- target: "result.messages[0].content.text"
matcher:
icontains: "triage"
Fields per entry:
name(string, required). Human-readable test name.server(string, required). Key into the top-levelserversmap.prompt(string, required). Prompt name to fetch viaprompts/get.args(object, optional). JSON-serializable arguments passed to the prompt.expect(array, optional). Assertions against the result. The result shape is{ "messages": [{ "role", "content": { "type", "text" } }] }, so targets read likeresult.messages[0].content.text.tags,threshold, andderivedMetricswork the same as on tool tests.
Both run against any MCP server, including the built-in mcptest mock, which serves resources/* and prompts/* from its manifest.
agents block
Type: array, optional. Each entry runs a real LLM against one or more MCP servers and asserts against the resulting conversation trace. Background and design rationale live in docs/models.md and docs/concepts.md.
agents:
- name: weather query routes to get_weather
# Either `model:` (singleton) or `models:` (matrix).
models:
- claude-sonnet-4-5 # auto-detect family
- { provider: openrouter, id: openai/gpt-4o } # named provider
servers: [weather] # one or more MCP servers
prompt: What is the weather in Sacramento?
system_prompt: | # optional
You are a weather assistant.
max_turns: 10 # optional, default 10
max_tokens: 1024 # optional, default 1024 (per call)
token_budget: 50000 # optional, cumulative tokens across the run
max_tool_calls: 20 # optional, total tool calls across the run
time_budget: 30s # optional, wall-clock deadline for the run
expect:
- target: tool_calls[0].name
matcher: { exact: get_weather }
- target: tool_names # the whole trajectory, in order
matcher: { contains-all: [get_weather] }
- target: redundant_tool_calls # no repeated (name, args) calls
matcher: { exact: 0 }
- target: conversation.tokens.total
matcher: { regex: "^[0-9]+$" }
Per-entry fields:
| Field | Required | Purpose |
|---|---|---|
name | yes | Test name shown in the reporter. |
model or models | yes | Exactly one. model is sugar for models: [<value>]. Each entry under models: is either a bare string (auto-detected family) or { provider: <name>, id: <model> } referencing a providers: entry. |
servers | yes | At least one server key from the top-level servers: map. The driver exposes the merged tool catalog from every named server to the model. |
prompt | yes* | User prompt that kicks off the conversation. *Optional when a turns: block supplies the user side (see Multi-turn conversations). |
turns | no | Drive a multi-turn conversation instead of a single prompt: a scripted list of { user: ... } turns, or a { simulate: ... } block where an LLM plays the user. See Multi-turn conversations. |
system_prompt | no | Optional system prompt. |
max_turns | no | Hard cap on tool-use iterations. Defaults to 10. |
max_tokens | no | Per-call max_tokens for the model. Defaults to 1024. |
token_budget | no | Cumulative token budget across the whole run. Distinct from per-call max_tokens: this sums every turn's usage and stops the run once the total crosses the cap. Omit for an unbounded run. |
max_tool_calls | no | Cap on total tool invocations across the run. The run stops once the count crosses the cap. Omit for an unbounded run. |
time_budget | no | Wall-clock deadline for the run as a human duration string (30s, 2m, 1h, 500ms). The run stops at the next loop boundary once the deadline passes. Omit for an unbounded run. |
inject | no | Names of top-level faults to inject into this run. A call matching an injected fault is synthesized as an unresponsive timeout. |
recovery | no | Recovery gate scored against the injected fault: max_detection_ms (required), max_recovery_ms (optional), require_clean_timeout (optional). See faults block. |
expect | no | Assertions evaluated against the trace envelope (see below). |
The run-wide caps (token_budget, max_tool_calls, time_budget) all default to off, so a suite that omits them runs exactly as before. When a cap trips, the run stops with a typed error that names the cap and its limit, and the trace records the stopping cap on conversation.stop_reason (one of completed, turn_budget, token_budget, tool_call_budget, time_budget).
Matcher target grammar for agent runs:
| Target | What it resolves to |
|---|---|
tool_calls[i].name | Bare tool name the model picked (the <server>__ prefix is stripped before the trace is recorded). |
tool_calls[i].server | MCP server the call routed to. Always set, even on single-server runs. |
tool_calls[i].args.<path> | Arbitrary JSON path into the arguments the model passed. |
tool_calls[i].hop_index | Zero-based position of the call in the run. Assert a tool was called at a specific step, not just "at some point". |
tool_calls[i].agent_id | Which agent made the call, in multi-agent runs. Absent (single-agent) keeps the legacy shape. |
tool_calls[i].inputs_digest | Stable fingerprint of the context the agent had when it made the call. Assert it is unchanged across runs for a determinism check. |
tool_calls.length | Total number of tool calls in the run. |
tool_names | Ordered array of the tool names called, for asserting the whole trajectory in one matcher: exact: [search, fetch] pins the sequence, contains-all: [search] requires a tool, not: { contains: danger } forbids one. |
redundant_tool_calls | Count of tool calls that repeat an identical (name, args) pair already made earlier in the run. A deterministic backtracking / efficiency signal: assert exact: 0 for a clean plan, or a ceiling. |
tool_results[i].isError | Whether the server flagged the result as an error. |
tool_results[i].server | Mirror of the matching tool_calls[i].server. |
tool_results[i].result.<path> | Arbitrary JSON path into the server's response. |
final_response | The model's last plain-text reply. Empty when the loop ran out of turns. |
conversation.tokens.total, conversation.tokens.prompt, conversation.tokens.completion | Cumulative token usage reported by the provider. |
conversation.duration_ms | Wall-clock duration of the agent loop. |
conversation.message_count | Number of user / assistant / tool messages exchanged. |
conversation.stop_reason | Why the run stopped: completed, turn_budget, token_budget, tool_call_budget, or time_budget. |
conversation.per_turn[i].user, .final_response, .tool_calls[j].name, ... | Per-user-turn breakdown for a multi-turn run (see below). Empty on a single-prompt run. |
Multi-turn conversations
By default an agent test drives one prompt through one tool-using loop. A turns: block drives a multi-turn conversation instead, carrying the conversation and tool state across turns. The top-level trace still aggregates every turn (so tool_calls, final_response, and the single-turn metrics resolve over the whole conversation), and each turn's slice is recorded under conversation.per_turn[i].
Scripted: an ordered list of user messages the driver replays.
agents:
- name: multi-step trip planning
model: claude-sonnet-4-5
servers: [flights, hotels]
turns:
- user: "Find a flight SFO to JFK next Friday"
- user: "Now book a hotel near JFK for that night"
eval:
multi_turn_mcp_use: { threshold: 0.6 } # averages per-turn MCP use
Simulated: an LLM plays the user, generating each turn from a goal, an optional persona, and the conversation so far, until it answers DONE or hits max_turns (default 6). The simulator shares the agent's model, so with no API key it falls back to the deterministic stub and CI stays green.
turns:
simulate:
goal: "Book a same-day round trip and add it to my calendar"
persona: "terse traveler who changes their mind once"
max_turns: 6
With a turns: block, prompt is optional. Each turn's slice exposes conversation.per_turn[i].user, .final_response, .tool_calls[j], and .tool_results[j] (with call_index local to the turn), so an expect: can assert on a specific turn, e.g. target: conversation.per_turn[1].tool_calls[0].name.
eval (judged metrics)
An agent test can carry an eval: block of judged metrics that score the run trace after it completes. Each metric produces its own PASS/FAIL report row (alongside the per-run expect: rows) and gates the exit code. Metrics are judged by the agent's own model, so a run with no provider key defers them (reported, not failed) and CI stays green.
agents:
- name: books a meeting end to end
model: claude-sonnet-4-5
servers: [calendar]
prompt: "Book a meeting with Alice next Tuesday at 2pm and confirm it."
eval:
mcp_task_completion: { threshold: 0.6 } # did the run accomplish the task?
mcp_use: { threshold: 0.6 } # right tools, right arguments?
argument_correctness: {} # tool-call arguments correct?
plan_quality: {} # sensible call sequence?
rubric: # your own weighted criteria
threshold: 0.8
criteria:
- name: booked the right day
weight: 2
description: "The agent created an event on the correct Tuesday."
- name: confirmed to the user
description: "The final reply confirms the booking."
Metrics:
mcp_task_completion(default threshold 0.5): did the run accomplish the task in the prompt?mcp_use(default 0.5): were the right tools called with sensible arguments for the intent?argument_correctness(default 0.5): are each tool call's arguments correct and complete for the intent?plan_quality(default 0.5): is the sequence of tool calls a sensible plan (no redundancy, ordering, or gaps)?multi_turn_mcp_use(default 0.5): MCP-use alignment averaged across a multi-turn conversation. Eachconversation.per_turnslice is judged for tool-use quality and the per-turn scores are averaged. Requires aturns:block (see below); on a single-promptrun it defers (reported, not failed).rubric(default threshold 0.7): your own custom rubric, in one of two forms. The flat form is a weighted, multi-criterion list: eachcriteria[]entry isname+description(the judge rubric) + an optionalweight(default 1.0), and the score is the weight-normalized average of the per-criterion judgments, each reported with its own reason. Addstrict: trueto require a perfect score. Addrequire_evidence: trueso each criterion's judge must cite a verbatim span from the candidate; a criterion with no evidence scores 0 and gates the eval. A criterion may carryexamples:(labeledresponse+scorepairs) that are appended to its judge prompt as few-shot calibration anchors. A criterion may also carry awhen:predicate ({ contains: <str> }or{ regex: <pattern> }) so it is judged only when the candidate matches, and its ownthreshold:that overrides the rubric default for that criterion's gate.
A rubric can instead be a decision tree: a branching set of yes/no judgments that ends at a fixed score. Each ask node poses one narrow question; the judge answers it, and the run descends the yes or no branch (a yes is a judge score of 0.5 or higher). A score leaf ends the walk with that score (0..1) and an optional reason. One narrow question per node is easier to judge reliably and to audit than one holistic score, and only the questions reach the model. The report shows the exact path taken.
eval:
rubric:
threshold: 0.7
tree:
ask: "Did the agent call the get_weather tool?"
yes:
ask: "Does the final reply state a temperature?"
yes: { score: 1.0, reason: "called the tool and reported a temperature" }
no: { score: 0.4, reason: "called the tool but gave no temperature" }
no: { score: 0.0, reason: "never called the weather tool" }
A tree-mode failure note records the path, e.g. path: Did the agent call the get_weather tool? -> yes, Does the final reply state a temperature? -> no. Give a rubric either criteria or tree, not both.
Every metric scores 0..1 and gates on its threshold. The score and the judge's reasons appear in the report, with secrets redacted.
Auto-detected provider families (set the env var for each model you want to exercise; missing keys fall back to a deterministic stub so CI stays green):
| Family | Detected when model starts with | Env var |
|---|---|---|
| Anthropic | claude- | ANTHROPIC_API_KEY |
| OpenAI | gpt-, chatgpt-, o<digit>, text-, davinci- | OPENAI_API_KEY (+ optional OPENAI_ORG_ID) |
gemini-, models/gemini- | GEMINI_API_KEY (or GOOGLE_API_KEY) | |
| Mistral | mistral-, codestral-, magistral-, ministral-, devstral-, pixtral-, open-mistral-, open-mixtral- | MISTRAL_API_KEY |
For anything else (Azure OpenAI, OpenRouter, vLLM, LiteLLM, Groq, Together, etc.), see providers block.
faults block
Type: array, optional. Declares named fault injections an agent test opts into with inject:. A fault makes a tool behave like an unresponsive backend so the agent's recovery path is exercised. The executor synthesizes the fault in virtual time (no real hang), so a recovery test is deterministic and CI-bounded.
| Field | Required | Purpose |
|---|---|---|
name | yes | Unique fault name an agent test references via inject:. |
target | yes | Which calls the fault applies to: { tool: <name> }, { server: <key> }, or both. At least one is required. A missing tool matches any tool; a missing server matches any server. |
kind | yes | One of hang, wedged, slow, recover_after. |
delay_ms | for slow | Milliseconds before the call answers. A delay at or beyond the run's max_detection_ms reads as unresponsive. |
failures | for recover_after | Number of calls that hang before the server answers normally. |
The agent test that injects a fault gates the run with a recovery: block:
| Field | Required | Purpose |
|---|---|---|
max_detection_ms | yes | Per-call timeout budget. A correctly configured agent gives up on a hung call within this rather than blocking forever. |
max_recovery_ms | no | Cap on total recovery time. Omit to leave recovery time unbounded (the gate then only requires that the agent recovered at all). |
require_clean_timeout | no | When true, also require a clean timeout within the detection budget. Defaults to false. |
faults:
- name: hung-search
target: { tool: search }
kind: hang
agents:
- name: recovers from a hung search tool
model: claude-sonnet-4-5
servers: [faulty]
inject: [hung-search]
prompt: Search for the latest incident report.
recovery:
max_detection_ms: 3000
max_recovery_ms: 5000
require_clean_timeout: true
A poor-recovery agent (one that loops on the hung tool until its turn budget trips, never replying) fails the run as a quality failure (exit 1). A dangling inject: name, or a connect failure, is an infra error (exit 2). See Fault injection and recovery for the scoring model.
providers block
Type: object, optional. Declares custom OpenAI-compatible endpoints that agent models: entries can target by name. The wire shape is the OpenAI Chat Completions API, so any gateway or self-hosted server speaking that shape is reachable through one of these.
providers:
openrouter:
type: openai
base_url: https://openrouter.ai/api/v1
api_key_env: OPENROUTER_API_KEY
azure-prod:
type: openai
base_url: https://my-resource.openai.azure.com/openai/deployments/my-gpt-5
api_key_env: AZURE_OPENAI_KEY
organization: my-azure-org # optional
local-vllm:
type: openai
base_url: http://localhost:8000/v1
# api_key_env omitted: unauthenticated local endpoint. The runner
# sends `Authorization: Bearer EMPTY`, which the common
# self-hosted servers accept.
agents:
- name: my test
models:
- { provider: openrouter, id: anthropic/claude-3.5-sonnet }
- { provider: azure-prod, id: my-gpt-5 }
...
Per-entry fields:
| Field | Required | Purpose |
|---|---|---|
type | yes | Wire protocol. Today the only supported value is openai. |
base_url | yes | Endpoint base URL the runner POSTs to. |
api_key_env | no | Environment variable that holds the bearer token. Omit for unauthenticated local endpoints. |
organization | no | Optional OpenAI-Organization header. |
Worked example: agent-custom-providers.yml.
budget block
Type: object, optional. Per-test and per-suite spend caps applied to agent runs. Both fields are USD cents; missing fields disable that scope.
budget:
per_test_usd_cents: 50
per_suite_usd_cents: 500
| Field | Required | Purpose |
|---|---|---|
per_test_usd_cents | no | Cap on the dollars one agent test can spend. The counter resets between tests. |
per_suite_usd_cents | no | Cap on the dollars an entire suite can spend. Useful when a matrix run multiplies the per-test spend across N models. |
When a cap trips the agent loop stops with a clear error before the provider issues a surprise bill.
expect block (the matchers)
Type: array of assertions, optional, default [].
Each assertion combines a target (where to look in the response) with a matcher (how to judge it) and an optional message (human-friendly description used in reporter output).
expect:
- target: "result.content[0].text"
matcher:
contains: "hello"
message: "tool should greet by name"
Fields per assertion:
target(string, required, no default). JSONPath-style expression that extracts a value from the response. Example:result.content[0].text.matcher(object, required, no default). One matcher predicate. See the matchers documented below.message(string, optional, no default). Friendly description used in failure output. Falls back to the matcher type when omitted.transform(string, optional, no default). A command that rewrites the extracted value before this matcher compares it. The value attargetgoes to the command's stdin as JSON and the parsed stdout replaces it, so you can lowercase a string, strip a volatile id, or reshape one field without touching the rest of the response. Works on tool and agent tests. See transforms.md for the contract.
expect:
- target: "result.name"
transform: jq ascii_downcase
matcher:
exact: "echo"
A matcher object carries exactly one key. The valid keys in v1 are exact, contains, regex, schema, snapshot, llm-judge, llm-jury, contains-all, contains-any, icontains, starts-with, is-json, is-valid-tools-call, levenshtein, is-xml, is-sql, similar, cel, factuality, answer-relevance, context-faithfulness, and not. The matcher selection is mutually exclusive at the schema level (minProperties: 1, maxProperties: 1), so the loader rejects a matcher that tries to combine exact: true with regex: "foo".
Every failed assertion produces a structured failure with the following common fields:
{
"test_name": "echoes the input back",
"target": "result.content[0].text",
"matcher": "regex",
"message": "echo should round-trip the rendered message",
"expected": "^ping from acct_",
"actual": "ping from acct_demo"
}
Reporters render this structure into pretty text, JSON, or JUnit XML. The matcher-specific shape is documented per matcher below.
exact matcher
Strict structural equality. Compares the extracted value to the matcher argument using JSON equality rules (objects, arrays, primitives).
expect:
- target: "result.content[0].text"
matcher:
exact: "hello, world"
Failure shape:
{
"matcher": "exact",
"expected": "hello, world",
"actual": "hello, world!",
"diff": "+!"
}
Use exact when you want a deep-equal match. Reach for contains or regex when the response embeds extra text you do not care about.
contains matcher
Containment, matched by the type of the value:
- Strings: case-sensitive substring.
contains: "Sacramento"passes against"It is rainy in Sacramento.". Useicontainsfor the case-insensitive form andexactfor full equality. - Objects: subset. Every key in the matcher value must be present in the actual value and its value must itself satisfy
contains(extra keys in the actual value are ignored). - Arrays: multiset. Every element in the matcher value must have a distinct matching element in the actual array, in any order.
Non-string scalars (numbers, booleans, null) compare by equality.
Object subset example (passes even though the actual result has more keys):
expect:
- target: "result"
matcher:
contains:
isError: false
Array multiset example (passes when every listed element is present):
expect:
- target: "result.tags"
matcher:
contains: ["urgent", "billing"]
Failure shape (object subset, missing key):
{
"matcher": "contains",
"path": "/isError",
"expected": false,
"note": "missing key `isError` in actual value"
}
See also: exact, icontains, regex.
regex matcher
Tests whether the regex pattern matches anywhere in the stringified value. Uses the regex crate's default syntax.
expect:
- target: "result.content[0].text"
matcher:
regex: "^ping from acct_"
Failure shape:
{
"matcher": "regex",
"expected": "^ping from acct_",
"actual": "pong from svc_demo",
"match": null
}
When the regex compiles but does not match, match is null. When the regex itself is invalid, the loader fails the entire suite at validation time with a pointer to the offending field.
schema matcher
Validates the value against an inline JSON Schema. Useful for asserting the shape of a structured tool response without pinning specific values.
expect:
- target: "result"
matcher:
schema:
type: object
required: ["content", "isError"]
properties:
content:
type: array
minItems: 1
isError:
type: boolean
Failure shape:
{
"matcher": "schema",
"errors": [
{
"instance_path": "/isError",
"schema_path": "/properties/isError/type",
"message": "expected boolean, got string"
}
]
}
The schema runs through the same jsonschema crate the loader uses, so draft 2020-12 features such as if/then/else, prefixItems, unevaluatedProperties, oneOf/anyOf/allOf, and $defs with internal $ref work as documented. To assert that a value is an array with at least one element (the role length used to play), use type: array with minItems.
Security limits. Two guards run before validation:
- External
$refURIs are refused at compile time. Any$refvalue that does not start with#(a same-document fragment) fails the matcher withSchemaExternalRef. The validator never reaches the network. - The schema's JSON nesting depth is capped (default 64). A schema deeper than the cap fails with
SchemaTooDeepbefore compile. - Validation iteration runs under a wall-clock budget (default 2 seconds). Exceeding the budget fails the matcher with
SchemaValidationTimedOut.
Real-world MCP schemas sit far under these defaults, so a hand-authored suite never trips them. The limits exist to keep a pathological schema or a malicious server response from stalling the runner.
See also: exact, contains, the oneOf / anyOf / allOf composition matchers below.
snapshot matcher
Records the extracted value to disk on the first run, then deep-compares the extracted value against the recorded copy on every subsequent run. A mismatch fails the test with a readable diff. The shorthand is a string (the snapshot key); the long form is an object with optional flags.
expect:
- target: "result.content"
matcher:
snapshot: "lists-tools-content"
The key resolves to a file under the snapshots/ directory next to the suite YAML: snapshot: "lists-tools-content" writes and reads <suite_dir>/snapshots/lists-tools-content.json. Keys may be nested (snapshot: "tools/echo/baseline" becomes <suite_dir>/snapshots/tools/echo/baseline.json); intermediate directories are created on first write. A .json extension is appended automatically unless the key already carries an extension.
On the first run (or when the snapshot file is missing) the matcher records the current value and passes. On later runs it loads the recorded value and compares; the test fails when they differ. To re-record after an intentional change, run with --update-snapshots (or -u), which rewrites every snapshot the run touches and passes. Review the resulting git diff before committing.
mcptest run --update-snapshots
--update-snapshots is refused when CI=true is set, so a CI job never silently rewrites a golden file. Pass --allow-update-in-ci to override.
Long form (per-test update override):
expect:
- target: "result.content"
matcher:
snapshot:
name: "lists-tools-content"
update: false
name(string). Snapshot key, resolved as above.update(boolean, defaultfalse). Whentrue, this test's snapshot is re-recorded regardless of the global flag. Use sparingly, for a single fast-moving test.
Failure shape:
{
"matcher": "snapshot",
"note": "snapshot at snapshots/lists-tools-content.json did not match (run with --update-snapshots to refresh)",
"expected": "...recorded value...",
"actual": "...current value..."
}
See also: exact, schema, cassette and snapshot layout.
llm-judge matcher
Routes the candidate string through an LLM with a grading rubric. The matcher passes when the judge returns pass: true or a score at or above threshold. The judge model is invoked through the same provider lookup the agent driver uses (env-var auto-detection plus named providers), so any provider family configured for the suite is fair game. The literature grounding is in docs/research-references.md.
expect:
- target: final_response
matcher:
llm-judge:
rubric: |
The response must mention Sacramento and at least one
temperature number, and must not invent details the tool
did not return.
threshold: 0.8
model: claude-sonnet-4-5 # optional override
Fields under llm-judge:
rubric(string, required, no default). Grading rubric handed to the judge model. Multi-line is fine.threshold(number in[0, 1], optional, default0.7). When the judge's reply carries ascorefield, the matcher passes whenscore >= threshold. The reply can also carrypass: trueto short-circuit the score comparison.model(string, optional). Override the judge model id. When omitted the runner uses the same provider the executor was built with (see thewith_providerhook onMcpExecutor).
Failure note: the matcher's diff carries the score and the judge's one-sentence reason, so the reporter shows you why the judge said no.
llm-jury matcher
Like llm-judge, but runs N independent jurors and requires a quorum to pass. Useful when one judge model is itself the subject of the test (so you do not want it grading its own output) or when the assertion is high-stakes enough to want consensus. Reports inter-juror agreement (Krippendorff's alpha) alongside the verdict so a split jury is obvious.
expect:
- target: final_response
matcher:
llm-jury:
rubric: |
Reply pass when the response (1) names Sacramento, (2) cites
the temperature the tool returned, (3) is no longer than two
sentences.
jurors:
- model: claude-sonnet-4-5
- model: gpt-5
- model: gemini-2.5-pro
threshold: 0.7
quorum: 0.66
Fields under llm-jury:
rubric(string, required). Same shape asllm-judge.rubric.jurors(array, required, at least one entry). Each entry has amodel:field; an optional per-jurorthreshold:overrides the jury-wide default.threshold(number, optional, default0.7). Default per-juror score threshold; jurors whose score meets it count as a pass.quorum(number in[0, 1], optional, default0.5). Fraction of eligible jurors that must pass for the jury verdict to be pass.1.0requires unanimous;0.5is simple majority.
Jurors that error out (network failure, malformed reply) are recorded as abstaining; they do not count for or against quorum. The reporter shows each juror's verdict and the aggregate pass_fraction.
Worked example: examples/agent-llm-judge.yml.
See also: contains, regex, agents block.
contains-all matcher
The string value must contain every listed substring, or the array value must contain every listed element. The argument is an array of needles.
expect:
- target: "result.content[0].text"
matcher:
contains-all: ["order", "shipped", "tracking"]
contains-any matcher
The string value must contain at least one listed substring, or the array value must contain at least one listed element. An empty list never passes.
expect:
- target: "result.content[0].text"
matcher:
contains-any: ["approved", "accepted"]
icontains matcher
Case-insensitive substring containment. The argument is a single string.
expect:
- target: "result.content[0].text"
matcher:
icontains: "SUCCESS"
starts-with matcher
The string value must start with the given prefix.
expect:
- target: "result.content[0].text"
matcher:
starts-with: "data:image/png;base64,"
is-json matcher
The string value must parse as JSON. Pass ~ (null) for a parse-only check, or an object with a schema key to also validate the parsed document against an inline JSON Schema (the same engine the schema matcher uses).
expect:
- target: "result.content[0].text"
matcher:
is-json:
schema:
type: object
required: ["id", "status"]
is-valid-tools-call matcher
The value must be a well-formed MCP tool call: an object with a string name and an arguments object. Pass ~ (null) for a shape-only check, or an object with a schema key to validate the call's arguments against an inline JSON Schema.
expect:
- target: "tool_calls[0]"
matcher:
is-valid-tools-call:
schema:
type: object
required: ["city"]
levenshtein matcher
The Levenshtein edit distance between the stringified value and the reference value must be at most max. Useful for "close enough" text comparisons without an embedding model. The distance counts insertions, deletions, and substitutions over Unicode scalar values.
expect:
- target: "result.content[0].text"
matcher:
levenshtein:
value: "the quick brown fox"
max: 3
is-xml matcher
The string value must parse as well-formed XML. Pass ~ (null) for a well-formedness check, or an object with a root key to also assert the document's root element name. The value is read to EOF with a streaming parser, so any malformed markup (an unclosed tag, mismatched element) fails the assertion.
expect:
- target: "result.content[0].text"
matcher:
is-xml:
root: "note"
is-sql matcher
The string value must parse as SQL. The value is handed to a real SQL parser using the permissive generic dialect, so it accepts the broadest set of statements rather than pinning one engine's grammar. Useful for a "text to SQL" tool whose output should at least be syntactically valid. The matcher takes no options; pass ~ (null).
expect:
- target: "result.content[0].text"
matcher:
is-sql: ~
similar matcher
Embedding cosine similarity to a reference string. The matcher embeds both the reference value and the actual value with the named embedding model, then passes when the cosine similarity is at least threshold (a floor in 0..1). Use it when two answers can be worded differently but mean the same thing, where levenshtein (character distance) is too literal.
Like llm-judge, this matcher calls an external model, so it needs a configured provider with an embeddings endpoint (OpenAI text-embedding-3-*, Google, Mistral, or a local Ollama embedding model). Anthropic has no embeddings API, so pointing similar at it fails with a clear unsupported-feature error. Because it is async, similar is not composable under the not: wrapper.
expect:
- target: "result.content[0].text"
matcher:
similar:
value: "the order shipped and is on its way"
threshold: 0.85
model: "text-embedding-3-small"
cel matcher
A CEL (Common Expression Language) boolean predicate over the value. Reach for it when the built-in matchers do not express the check you need and you want a deterministic rule rather than an LLM judge. The value at target is bound into the expression as the variable value, so the predicate reads the resolved target directly. Omit target (or set it to "") to bind the whole response envelope as value.
CEL is the same predicate language SBproxy uses, so one expression dialect carries across the gateway and the tester. It evaluates in-process with no model and no network call, so unlike llm-judge it is deterministic and composes under not:. The regex helper is available, so value.matches("^ok") works. The expression must return a boolean; a parse error, an evaluation error, or a non-boolean result fails the run as a suite bug rather than a failed assertion.
Keep the split with transform: in mind: jq under transform: reshapes data (one value in, one value out), while cel: is a boolean predicate. Do not reach for one where the other fits.
expect:
- target: "result"
matcher:
cel: "value.content.size() > 0 && !value.isError"
factuality, answer-relevance, context-faithfulness matchers
Named model-graded matchers with a shared vocabulary for RAG and QA suites. Each is sugar over llm-judge: it carries one field and desugars to a judge call with a vetted, fixed rubric that embeds that field, so the judge plumbing, threshold gating, and calibration all carry over. Like llm-judge, each needs a configured provider, and each takes an optional threshold (default 0.7 on the judge's 0..1 score) and a model override.
factualityjudges whether the value is consistent with areference.answer-relevancejudges how directly the value answers aquery.context-faithfulnessjudges whether every claim in the value is grounded in acontext(it penalizes hallucination).
expect:
- target: "result.content[0].text"
matcher:
factuality:
reference: "Paris is the capital of France"
- target: "result.content[0].text"
matcher:
answer-relevance:
query: "What is the capital of France?"
threshold: 0.8
- target: "result.content[0].text"
matcher:
context-faithfulness:
context: "France is a country in Europe. Its capital is Paris."
not matcher
Universal negation. Wraps another matcher and passes when that matcher fails (and fails when it passes). One not: wrapper composes over every deterministic matcher, so there is no per-type not-contains or not-regex variant to learn. Negation is limited to deterministic matchers; wrapping a stateful matcher (snapshot) or an async/LLM matcher (llm-judge, llm-jury, similar) is an evaluation error.
expect:
- target: "result.content[0].text"
matcher:
not:
contains: "error"
oneOf / anyOf / allOf composition matchers
JSON Schema 2020-12 ships three composition keywords. mcptest exposes the same composition at the matcher level so a suite can compose any matchers, not just schemas, across one extracted target.
oneOf: passes when exactly one inner matcher passes.
expect:
- target: "result.content[0].text"
matcher:
oneOf:
- exact: "ok"
- exact: "ready"
anyOf: passes when at least one inner matcher passes.
expect:
- target: "result.content[0].text"
matcher:
anyOf:
- contains: "ok"
- contains: "ready"
- regex: "^accepted-\\d+$"
allOf: passes when every inner matcher passes. Useful for combining a positive predicate with a negative one without two separate assertions.
expect:
- target: "result"
matcher:
allOf:
- contains: { isError: false }
- not:
contains: { content: [{ text: "" }] }
The body must be a non-empty YAML sequence of matchers. An empty sequence is a load-time error.
Composition wraps only deterministic matchers, the same scope as not:. An inner snapshot, llm-judge, llm-jury, or similar is an evaluation error rather than a silent skip, because a composition that silently dropped a branch would change its truth value without the suite author knowing. Structural errors from inner matchers (a bad regex, a malformed schema) surface unchanged rather than being swallowed as a failing branch.
Compositions nest. allOf: [ anyOf: [ ... ], not: { ... } ] is legal and behaves the obvious way.
Budgets and headers (not matchers)
Four expect-level fields look like matchers in casual conversation but live alongside matcher: inside the expect: block, not inside the matcher: object itself. They apply to the whole step, not to one extracted target. To use them, write the expect: block in its long form: an object with an assertions: array plus any of the fields below.
max_duration_ms(integer). Per-step duration budget. Fails the step when wall-clock duration exceeds the budget. Cache hits skip the check.max_response_tokens(integer or object). Caps the response token count. Long form takesbudget,tokenizer,mode, andimage_cost. The schema enforces these field names; seeschemas/v1.jsonfor the exact shape.response_headers(object). For URL servers, asserts that named headers match a string, regex, schema,exists, orcontainspredicate.response_headers_absent(array of strings). For URL servers, asserts that the named headers are not present.
expect:
assertions:
- target: "result.content[0].text"
matcher:
contains: "hello"
max_duration_ms: 500
response_headers:
content-type: "application/json"
response_headers_absent: ["set-cookie"]
Naming history
An earlier draft of the schema used equals, length, and jsonpath. Those names were dropped before v1.0 ships; the canonical matcher keys are listed above. Configs that still use the old names will fail schema validation with a clear "additional property not allowed" error.
Test styles
mcptest supports three test styles, written to feel familiar whether you think in matchers, in step-by-step protocol traces, or in raw JSON-RPC.
Flat (default)
The flat style is one test per array entry, one tool call per test. This is what every example above uses. It is the style 95 percent of suites should adopt.
tools:
- name: "lists tools without error"
server: filesystem
tool: "list_directory"
args:
path: "/tmp"
expect:
- target: "result.content"
matcher:
schema:
type: array
minItems: 1
- name: "reads a known fixture"
server: filesystem
tool: "read_file"
args:
path: "${fixture_path}"
expect:
- target: "result.content[0].text"
matcher:
contains: "hello"
Stepwise
Stepwise tests run multiple tool calls in order, sharing state between steps. The current schema captures stepwise behavior via separate tools: entries that share ${variables}. A dedicated steps: field is on the roadmap (tracked under a follow-up ticket).
For now, model a multi-step interaction by chaining flat tests against the same server and using a variables: entry to thread shared identifiers:
variables:
resource_id:
value: "res_abc123"
tools:
- name: "step 1: creates a resource"
server: remote_api
tool: "create_resource"
args:
id: "${resource_id}"
expect:
- target: "result.isError"
matcher:
exact: false
- name: "step 2: reads the resource back"
server: remote_api
tool: "read_resource"
args:
id: "${resource_id}"
expect:
- target: "result.content[0].text"
matcher:
contains: "${resource_id}"
- name: "step 3: deletes the resource"
server: remote_api
tool: "delete_resource"
args:
id: "${resource_id}"
expect:
- target: "result.isError"
matcher:
exact: false
Raw JSON-RPC
Raw JSON-RPC tests sidestep the tool-call abstraction and assert against the literal protocol envelope. The v1 schema models this through compliance: checks. Use the check: field to name a built-in protocol sequence (for example, initialize or tools/list), then assert on the raw response.
compliance:
- name: "negotiates capabilities on initialize"
server: filesystem
check: "initialize"
expect:
- target: "result.protocolVersion"
matcher:
regex: "^2\\d{3}-\\d{2}-\\d{2}$"
- target: "result.capabilities"
matcher:
schema:
type: object
required: ["tools"]
- name: "advertises required tools"
server: filesystem
check: "tools/list"
expect:
- target: "result.tools"
matcher:
schema:
type: array
minItems: 1
A wider raw-frame style (full request envelope under your control, response captured as a JSON value) is on the roadmap. The current built-in checks are documented in compliance block.
See also: tools block, compliance block.
Variable interpolation
mcptest interpolates ${name} references in any string field at load time (after schema validation, before the first test runs). The available forms:
${VAR}resolvesVARfrom the merged variable scope (variables:block plus dotenv plus process env). IfVARis unset, the reference resolves to the empty string and mcptest prints a warning listing every unresolved name (so a typo like${ACCOUNT_I}does not pass silently). SetMCPTEST_STRICT_VARS=1to turn that warning into a hard error, which is the recommended setting for CI.${VAR:-default}resolvesVARif set; otherwise inserts the literaldefault. The default is a string, not another reference. Use${VAR:-}(empty default) to declare a reference that is intentionally optional and silence the unset warning.${VAR:?}resolvesVARif set; otherwise fails the run with a clear error pointing at the field that demanded the value. Use this when a missing variable should halt the suite, for example a required API token.${capture:NAME}resolves a value captured from an earlier test's response. When the reference is the entire value of a field and the captured value is a JSON object or array, it is injected structurally (as a nested object or array), not as a quoted string, so a captured object can be reused directly as a later tool's argument. Mixed text such as"prefix ${capture:NAME}"keeps the stringified form.$VARis the short form, equivalent to${VAR}. Allowed for one-token references where the next character would not form a valid identifier.$$is the literal-dollar escape. Use it when you need a literal$in the rendered string.
The full precedence table lives at mcptest.sh/docs/secrets-and-variables.
A worked example combining all forms:
variables:
base_url:
value: "https://api.example.com"
tools:
- name: "uses every interpolation form"
server: remote_api
tool: "echo"
args:
url: "${base_url}/health"
token: "${MCPTEST_API_TOKEN:?}"
label: "${ENV_LABEL:-staging}"
short: "$ACCOUNT_ID"
literal: "the price is $$5"
expect:
- target: "result.isError"
matcher:
exact: false
bearer_token_env vs ${NAME}
These look similar and are not the same.
servers:
api_a:
url: "https://a.example.com"
auth:
bearer_token_env: "API_A_TOKEN"
api_b:
url: "https://b.example.com"
auth:
bearer_token_env: "${SECRET_VAR_NAME}"
For api_a, mcptest reads the env var named API_A_TOKEN directly and sends its value as a bearer header. The YAML never holds the secret.
For api_b, mcptest first interpolates ${SECRET_VAR_NAME} to find the name of the env var to read, then reads that env var. This is occasionally useful (rotating tokens with versioned names), but most suites should stick to the literal bearer_token_env: NAME form. Reporters know that the value read by bearer_token_env is a credential and redact it from logs. Plain ${NAME} interpolation gets no such treatment.
See also: variables block, auth block.
imports block
Type: array of strings, optional, default [].
Each entry is a relative or absolute path to another mcptest YAML file. The loader merges imports in order, with later imports overriding earlier ones, and the current file overriding all of its imports. The intended shape:
imports:
- "./shared/servers.yml"
- "./shared/variables.yml"
tools:
- name: "uses an imported server"
server: shared_filesystem
tool: "list_directory"
args:
path: "/tmp"
The full import implementation lands in a follow-up ticket. Today the loader recognizes the imports: array (the schema validates it) and prints a clear error if any path fails to resolve. Treat this section as a forward contract: write your suites against the documented shape, and the loader will start respecting them when the implementation lands.
Cycle detection is part of the same follow-up. Until then, do not import a file that imports your file. The loader will reject cycles with a structured error once the feature ships.
See also: variables block, servers block.
compliance block
Type: array of objects, optional, default [].
Compliance tests assert that the server speaks the MCP protocol correctly: capability negotiation, error shapes, required method presence. The v1 surface is a small set of built-in checks. The full compliance corpus is on the roadmap and not yet wired up.
compliance:
- name: "negotiates capabilities on initialize"
server: filesystem
check: "initialize"
expect:
- target: "result.protocolVersion"
matcher:
regex: "^2\\d{3}-\\d{2}-\\d{2}$"
Fields per entry:
name(string, required, no default). Human-readable test name.server(string, required, no default). Key into the top-levelserversmap.check(string, required, no default). Compliance check identifier. v1 ships a small built-in set, withinitializeandtools/listas the first two.expect(array of assertions, optional, default[]). Same shape as thetools[].expectblock; seeexpectblock.
The intended shape for the full corpus is a curated set of named checks shipped with the binary. Each check encodes a known-good interaction (for example, initialize issues an initialize request, validates the response, captures the server capabilities) and exposes assertion targets into the captured frames. The shape of those targets is the same as for tool tests, so the matchers in this doc apply unchanged.
See also: expect block, servers block.
evals block
Type: array of objects, optional, default [].
Evaluations grade a response against a rubric and return a score. An entry names a server and a prompt, supplies a rubric, and gates on a threshold. The rubric is the same shape the agent-side eval.rubric matcher uses, so a free-form string, a weighted criteria list, or a decision tree all work here.
A rubric takes one of three forms:
evals:
# 1. Free-form string: a single holistic judgment.
- name: "summary stays on topic"
server: remote_api
prompt: "Summarize the latest deployment."
rubric: "Answer must mention the service name and the release tag."
threshold: 0.7
# 2. Weighted criteria: the score is the weight-normalized average of the
# per-criterion judgments, each reported with its own reason.
- name: "booking quality"
server: calendar
prompt: "Book a meeting with Alice next Tuesday at 2pm and confirm it."
rubric:
threshold: 0.8
criteria:
- name: booked the right day
description: "Created an event on the correct Tuesday."
weight: 2
- name: confirmed to the user
description: "The final reply confirms the booking."
# 3. Decision tree: one yes/no question per node, ending at a fixed score.
- name: "weather answered"
server: weather
prompt: "What is the weather in Paris?"
rubric:
threshold: 0.7
tree:
ask: "Does the answer state a temperature?"
yes: { score: 1.0, reason: "reported a temperature" }
no: { score: 0.0, reason: "no temperature" }
Fields per entry:
name(string, required, no default). Human-readable test name.server(string, required, no default). Key into the top-levelserversmap.prompt(string, required, no default). Prompt sent to the model.rubric(optional). One of: a free-form string; a structured rubric object (criteriaortree, never both, mirroring theeval.rubricmatcher, carrying its ownthresholdandstrict); a reference{ ref: <name> }to a rubric in the top-levelrubrics:map, with optional inlinethreshold/strictoverrides; or a preset{ preset: <name> }(one ofhelpfulness,groundedness,safety,format-adherence,conciseness) with optional overrides and appendedcriteria.response(string, optional, no default). A fixed candidate response to grade. When present, the rubric grades this text directly: key-free and reproducible, the CI-safe path. When absent, the eval'spromptruns as a tool-using agent against itsserverand the whole run is graded; this needs a resolved provider and a reachable server, and defers otherwise.threshold(number in[0, 1], optional, default0.7). Pass threshold for the free-form string form. A score at or above the threshold passes.judge(object, optional). Per-eval judge configuration: amodeloverride, an optionaljury: { size, consensus }, or apanel: [model, ...]ensemble withaggregate(mean,median, ormajority) andtie_break(pass/fail). A jury grades the rubricsizetimes and passes on aconsensusfraction; a panel grades once per model and combines the verdicts. A panel takes precedence over a jury. OSS juries and panels are single-provider. The run header prints the projected judge-call count.matrix(object, optional). Fan the eval out acrossmodelsand/orprompts: one cell per combination, each reusing the same rubric, judge, and threshold so the per-cell scores compare apples to apples. Each cell is its own report row, so the comparison renders in every reporter.
Evaluations run via mcptest eval (the dedicated subcommand). The default mcptest run invocation skips them so a basic CI gate stays cheap. The judge model is resolved from the environment; without an API key (or without a response to grade) each eval defers, reported as passed with a note, so a key-free CI run stays green. See Rubric scoring for the full guide.
scorers block
Type: array of objects, optional, default []. Top-level key (a sibling of servers: and tools:, not nested under evals:).
A scorer is an external grader that returns the same {verdict, score, reasoning, cost_usd} envelope as the built-in judge, so reporters, --max-cost, and verdict caching stay uniform no matter who scored. The build ships the exec scorer; vendor types (braintrust, langsmith, ...) are not included.
scorers:
- name: factuality
type: exec
command: ["python3", "examples/scorers/braintrust.py"]
env:
BRAINTRUST_API_KEY: ${BRAINTRUST_API_KEY}
timeout_ms: 30000
Fields per entry:
name(string, required). Folded into the verdict cache key.type(string, required). Open-vocabulary.execships in the build; an unknown type parses fine and only fails at resolution with a "scorer type 'X' not available in this build" diagnostic, so the config format is the same whether or not a given scorer type is available.command(array, required forexec). Argv; the first element is the executable. mcptest writes{input, output, criteria, metadata}as one JSON document to the process stdin and reads{verdict, score, reasoning, cost_usd}from stdout.env(object, optional). Overlay for the subprocess.${VAR}references resolve against the process environment at parse time.cwd(string, optional). Working directory; inherits the parent cwd when omitted.timeout_ms(integer, optional, default30000). Per-call timeout.
A cost_usd: null response is accepted (external scorers often cannot compute cost) and does not count toward --max-cost. See the worked walkthrough and reference wrappers in external scorers.
See also: Research references, External scorers.
model_compatibility block
Type: array of strings, optional, default [].
Pure metadata. Each entry is a model identifier the test suite targets. Reporters use this list to label results when a run is associated with a specific model.
model_compatibility:
- "claude-opus-4-7"
- "claude-sonnet-4-5"
- "gpt-4o"
The full model-compatibility matrix (cross-model pass-fail grids, diff-against-baseline output) is the W8 milestone. v1 only records the list and surfaces it to reporters.
performance block
Type: object, optional, no default fields.
Time and duration units
mcptest uses two time conventions, and the field name tells you which:
- Fields whose name ends in
_ms(default_timeout_ms,timeout_ms,delay_ms,max_duration_ms,p95_latency_ms) take a plain integer number of milliseconds.default_timeout_ms: 30is 30 milliseconds, not 30 seconds. - Fields named
timeout,connect_timeout, andtime_budgettake a duration string with a unit suffix:ms,s,m, orh(for example30s,2m,500ms).
When in doubt, prefer the duration-string form where the schema allows it, since the unit is explicit. A bare integer in a duration-string field is rejected at load time.
performance holds two soft budgets that apply suite-wide.
performance:
default_timeout_ms: 10000
p95_latency_ms: 2000
Fields:
default_timeout_ms(integer at least 1, optional, no default). Default per-test timeout in milliseconds. Individualtools[].timeout_msoverrides this value.p95_latency_ms(integer at least 1, optional, no default). Soft p95 (95th-percentile) latency budget. Reporters highlight tests that breach the budget; the run still passes unless a test-level timeout fires.
Deeper performance work (load generation, sustained throughput, percentile-over-time output) lives behind a separate mcptest bench command and is out of scope for v1.
See also: tools block.
Profiles and baseline
Two practical knobs help large suites stay green over time. Both surfaces are stable on the CLI side; the YAML-side handles are in flight.
profile: keyword (in flight)
A profile: keyword on individual tests selects which test groups run for a given invocation. The three named profiles in flight are:
quick. Smoke tests only. Aim for under five seconds wall time.standard. The default. Everything except the slow-and-expensive set.full. Every test in the file, including evals and any slow integration paths.
The intended shape at the test-entry level:
tools:
- name: "fast smoke"
server: local
tool: "ping"
profile: "quick"
expect:
- target: "result.isError"
matcher:
exact: false
- name: "expensive scenario"
server: local
tool: "stress"
profile: "full"
args:
payload_kb: 1024
The CLI side already accepts --profile. The YAML schema entry for profile: is deferred (tracked under a follow-up ticket). Until it lands, filter via --include / --exclude test name patterns on the CLI.
Baselining known failures
There is no run-level expected-failures baseline yet: mcptest run does not take a --baseline flag. The baseline that ships today is the compliance baseline, a list of compliance rule IDs a server is known to fail, consumed by mcptest compliance run --baseline. A new rule failure flips the gate red; a baselined rule that starts passing is flagged so you can trim the file. See compliance-baseline.md for the file shape and the CI workflow.
Named error scenarios
A fixtures block at the top of the file declares reusable error shapes once, then any tool test can trigger one by name via inject_error:. The runner short-circuits the live dispatch path for these tests: when a tool entry sets inject_error: <name>, the executor synthesizes the {"result": {"jsonrpc":"2.0","error":{...}}} envelope from the named fixture and feeds it through the assertion pipeline as if the server had returned it. No request crosses the wire, no transport is spawned, no cassette is written. This lets a suite exercise the same failure paths (rate limits, expired tokens, 5xx blips) across many tool calls without standing up a stub server.
fixtures:
server: mock_github
errors:
- name: rate_limited
code: -32000
message: "GitHub API rate limit exceeded"
tool: create_issue
- name: auth_expired
code: -32001
message: "OAuth token expired"
applies_to: any
tools:
- name: "handles rate limiting gracefully"
server: mock_github
tool: create_issue
inject_error: rate_limited
args:
title: "Triage queue overflow"
expect:
- target: "result.isError"
matcher:
exact: true
Fields under fixtures:
server(string, optional, no default). Key into the top-levelserversmap identifying which server the fixtures attach to. Metadata only in v1; reporters use it for context.errors(array of objects, optional, default[]). Named error scenarios. Each entry is shaped like:name(string, required, no default). Unique identifier referenced byinject_error:from a tool test. Must be unique within the file.code(integer, required, no default). JSON-RPC error code returned in place of a successful tool result.message(string, required, no default). Human-readable error message echoed back to the caller.tool(string, one-of, no default). Scope the error to a single tool name. Mutually exclusive withapplies_to.applies_to("any", one-of, no default). Allow the error to be injected for any tool on the server. Mutually exclusive withtool.
Exactly one of tool or applies_to must be present. The schema rejects entries that set both or neither.
On the test side, inject_error: on a tool entry names a fixture:
inject_error(string, optional, no default). Thenameof an entry underfixtures.errors[]. The schema does not enforce that the name resolves to a declared fixture; that cross-reference check is the loader's job and produces the user-friendly error message when a name is misspelled.
The synthesized envelope matches the shape a real MCP server returns for a JSON-RPC error: a top-level result wrapper (so dotted assertion paths rooted at result. resolve), with jsonrpc: "2.0" and an error object that carries code and message. Authoring an assertion against an injected fixture works exactly like asserting against a real failure:
expect:
- target: "result.error.code"
matcher:
exact: -32000
- target: "result.error.message"
matcher:
icontains: "rate limit"
Worked examples ship at examples/named-errors-stdio.yml and examples/named-errors-url.yml.
Compositions
A compositions: block declares a directed acyclic graph (DAG) of tool calls that runs in topological order. Each node names its parents in needs:, and a ${id.field} reference whose id is not in needs: is a load-time error. See compositions.md for the full reference and worked example.
compositions:
- name: render-top-readme
nodes:
- id: search
tool: search_packages
args: { query: "mcp" }
- id: readme
needs: [search]
tool: get_readme
args:
package: "${search.top_hit.name}"
output: readme
expect:
- target: "nodes.readme.status"
matcher:
exact: "ok"
Assertion targets are addressable: composition.ran, composition.order, nodes.<id>.status, nodes.<id>.output, nodes.<id>.duration_ms, and output.* (the combined output from the output: node, or the last node in topological order when output: is omitted).
Cassette references
Cassettes record live MCP traffic to disk so a test can replay it deterministically later. The v1 cassette implementation is in flight in the mcptest-cassette crate. The YAML surface (a cassette: field on a tool test that points to a captured frame set) is documented in a separate reference once the implementation stabilizes.
Until cassettes ship, run against the live server. The schema does not currently validate a cassette: key; do not add one yet, the loader will reject it as unknown.
Putting it all together
A larger example exercising most of the v1 surface:
# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json
servers:
filesystem:
command: ["npx", "-y", "@modelcontextprotocol/server-filesystem", "/tmp"]
env:
LOG_LEVEL: "info"
remote_api:
url: "https://mcp.example.com/v1"
auth:
bearer_token_env: "MCPTEST_REMOTE_API_TOKEN"
wait_for_ready: "https://mcp.example.com/healthz"
variables:
fixture_path:
value: "/tmp/mcptest-fixture.txt"
account_id:
from_env: "MCPTEST_ACCOUNT_ID"
default: "acct_demo"
tools:
- name: "lists tools without error"
server: filesystem
tool: "list_directory"
args:
path: "/tmp"
expect:
- target: "result.content"
matcher:
schema:
type: array
minItems: 1
message: "directory listing should not be empty"
- name: "reads a known fixture"
server: filesystem
tool: "read_file"
args:
path: "${fixture_path}"
expect:
- target: "result.content[0].text"
matcher:
contains: "hello"
- name: "echoes the input back"
server: remote_api
tool: "echo"
args:
message: "ping from ${account_id}"
expect:
- target: "result.content[0].text"
matcher:
regex: "^ping from acct_"
compliance:
- name: "negotiates capabilities on initialize"
server: filesystem
check: "initialize"
- name: "advertises required tools"
server: remote_api
check: "tools/list"
expect:
- target: "result.tools"
matcher:
schema:
type: array
minItems: 1
evals:
- name: "summary stays on topic"
server: remote_api
prompt: "Summarize the latest deployment."
rubric: "Answer must mention the service name and the release tag."
threshold: 0.7
model_compatibility:
- "claude-opus-4-7"
- "claude-sonnet-4-5"
performance:
default_timeout_ms: 15000
p95_latency_ms: 2000
This file:
- Defines one stdio server (
filesystem) and one URL server (remote_api) with bearer-token auth and a readiness probe. - Declares two variables, one literal and one env-backed with a default.
- Runs three tool tests covering the
schema,contains, andregexmatchers, with a mix of literal and interpolated arguments. - Runs two compliance checks, one with no extra assertions and one with a
schemaassertion that requires at least one tool entry. - Declares one evaluation (skipped by
mcptest run, executed bymcptest eval). - Labels the suite with two model identifiers.
- Sets a 15-second default timeout and a 2-second p95 latency budget.
Run it with:
mcptest validate path/to/file.yml
mcptest run path/to/file.yml
validate runs schema validation only. run validates first, then executes every test in the file.
Field index
A quick cross-reference, alphabetical by full path.
| Path | Type | Required | Default | Section |
|---|---|---|---|---|
compliance[].check | string | yes | - | compliance |
compliance[].expect[] | array | no | [] | expect |
compliance[].name | string | yes | - | compliance |
compliance[].server | string | yes | - | compliance |
evals[].name | string | yes | - | evals |
evals[].prompt | string | yes | - | evals |
evals[].rubric | string | no | - | evals |
evals[].server | string | yes | - | evals |
evals[].threshold | number | no | 0.7 | evals |
fixtures.server | string | no | - | Named error scenarios |
fixtures.errors[].name | string | yes | - | Named error scenarios |
fixtures.errors[].code | integer | yes | - | Named error scenarios |
fixtures.errors[].message | string | yes | - | Named error scenarios |
fixtures.errors[].tool | string | yes (one of) | - | Named error scenarios |
fixtures.errors[].applies_to | "any" | yes (one of) | - | Named error scenarios |
imports[] | string | no | [] | imports |
model_compatibility[] | string | no | [] | model_compatibility |
performance.default_timeout_ms | integer | no | - | performance |
performance.p95_latency_ms | integer | no | - | performance |
servers.<name>.command | array | yes (stdio) | - | servers |
servers.<name>.env | object | no | {} | servers |
servers.<name>.url | string | yes (url) | - | servers |
servers.<name>.auth.bearer_token_env | string | yes (bearer) | - | auth |
servers.<name>.auth.oauth.client_id_env | string | yes (oauth) | - | auth |
servers.<name>.auth.oauth.authorization_url | string | yes (oauth) | - | auth |
servers.<name>.auth.oauth.token_url | string | yes (oauth) | - | auth |
servers.<name>.auth.oauth.scopes | array | no | [] | auth |
servers.<name>.wait_for_ready | string | no | - | servers |
tools[].args | object | no | {} | tools |
tools[].expect[] | array | no | [] | expect |
tools[].expect[].matcher.contains | any | yes (one of) | - | contains |
tools[].expect[].matcher.exact | any | yes (one of) | - | exact |
tools[].expect[].matcher.llm-judge | object | yes (one of) | - | llm-judge |
tools[].expect[].matcher.regex | string | yes (one of) | - | regex |
tools[].expect[].matcher.schema | object | yes (one of) | - | schema |
tools[].expect[].matcher.snapshot | string or object | yes (one of) | - | snapshot |
tools[].expect[].message | string | no | - | expect |
tools[].expect[].target | string | yes | - | expect |
tools[].inject_error | string | no | - | Named error scenarios |
tools[].name | string | yes | - | tools |
tools[].server | string | yes | - | tools |
tools[].timeout_ms | integer | no | - | tools |
tools[].tool | string | yes | - | tools |
variables.<name>.default | string | no | - | variables |
variables.<name>.from_env | string | no (one of) | - | variables |
variables.<name>.value | scalar | no (one of) | - | variables |
A field marked "yes (one of)" is required only when its sibling siblings are absent, per the oneOf constraints in the schema.
Deferred and in-flight summary
For quick reference, every field this page documents that is not yet validated by schemas/v1.json:
| Field or feature | Status |
|---|---|
servers.<name>.auth.headers | deferred to a future release |
servers.<name>.auth.mtls | deferred to a future release |
servers.<name>.headers | schema accepts, runner wires up |
servers.<name>.http | schema accepts, runner wires up |
servers.<name>.http.mtls | deferred to a future release |
tools[].data (parametric inputs) | in flight |
tools[].expect[].max_duration_ms | schema accepts, runner wires up |
tools[].expect[].max_response_tokens | schema accepts, runner wires up |
tools[].expect[].response_headers | schema accepts, runner wires up |
tools[].expect[].response_headers_absent | schema accepts, runner wires up |
tools[].profile | in flight |
tools[].cassette | in flight (mcptest-cassette crate) |
fixtures.errors[] model | shipped |
tools[].inject_error injection | shipped |
| Compliance corpus expansion | in flight |
| Container-managed servers | a future release |
| Full eval grader | in flight |
| Baseline expected-failures file | in flight |
If you find a field this page does not cover that the schema does validate, that is a doc bug; file an issue against the mcptest project.
See also
schemas/v1.json, the authoritative source of truth.examples/server-stdio.yml, worked stdio example.examples/server-url.yml, worked URL example with bearer auth.docs/research-references.md, the literature grounding for evals and the judge matcher.