mcptest docs GitHub

Fault taxonomy and execution safety

Three concerns sit at the runtime edge of robustness testing: which fault families a suite actually exercised, the safety policy that governs which tool calls a synthesizing feature may make, and whether a server stays correct when several clients talk to it at once. This page covers all three.

Runtime fault taxonomy coverage

A passing test suite tells you what your MCP server does right. It does not tell you which failure modes you actually checked for. MCP servers fail at runtime in recurring families: slips in the protocol lifecycle, tool-invocation crashes, schema gaps, state leaks, provider-integration breaks, security lapses, timeouts, and unenforced configuration, and a suite can look green while never having probed half of them.

mcptest fault-coverage answers one question: which runtime fault families did this suite exercise? It loads a versioned fault taxonomy, detects the probes your suite uses, and marks each family covered, partially covered, uncovered, or not applicable, with an optional coverage threshold you can gate CI on.

This complements the other quality lanes. compliance and conformance check spec adherence; security checks adversarial inputs; fault-coverage reports, across all of them, which fault families your run touched.

The eight fault families

The taxonomy is grounded in A Taxonomy of Runtime Faults in Model Context Protocol Servers. Each family has a stable category id; each leaf fault carries a FAULT-<CATEGORY>-<NAME> id.

CategoryFamilyExample leaf fault
PROTOProtocol interactionrequest processed before initialize completes
TOOLTool invocationtool crashes instead of returning an error result
SCHEMASchema enforcementstructuredContent violates the declared outputSchema
STATEState managementstate leaks across tests or sessions
PROVIDERModel-provider integrationprovider rejects the served tool schema
SECURITYSecurity validationtool-description injection followed
TIMEOUTTimeout and cancellationtool hangs without a clean timeout
CONFIGConfiguration not enforcedread-only hint not enforced

PROVIDER faults apply only to suites with agents:; a server-only suite reports that family as n/a rather than uncovered.

How coverage is computed

Each leaf fault is mapped to the mcptest probes that exercise it (a probe is a subcommand or a suite feature). A leaf is exercised when any of its probes ran. A family is then covered (every leaf was exercised), partial (some leaves were exercised), uncovered (the family applies but no leaf was exercised), or n/a (the family does not apply to this suite shape).

The command detects these probes from the suite YAML automatically:

Suite featureProbe token
a compliance: blockcompliance, conformance
a security: block or a trust_boundary:security, trust-boundary
faults: / agent recovery: / inject:fault-recovery
a negative_path: checknegative-path
an output_schema:output-schema
run_options.restart_policytest-isolation
agents: (with trajectory/golden_path/distractors/equal_function)agent, tool-selection

Probes you run out of band (a separate mcptest fuzz or mcptest schema-lint pass) are folded in with --probe.

A worked example

# Which fault families does this suite exercise, plus a fuzz pass run separately?
mcptest fault-coverage examples/negative-path.yml --probe fuzz
Runtime fault family coverage

  partial   Protocol interaction         2/3 faults  [PROTO]
  covered   Tool invocation              2/2 faults  [TOOL]
  partial   Schema enforcement           2/4 faults  [SCHEMA]
  uncovered State management             0/2 faults  [STATE]
  n/a       Model-provider integration   0/2 faults  [PROVIDER]
  uncovered Security validation          0/2 faults  [SECURITY]
  uncovered Timeout and cancellation     0/2 faults  [TIMEOUT]
  uncovered Configuration not enforced   0/1 faults  [CONFIG]

  38% of applicable faults exercised; 1 of 7 families fully covered

Gate CI on a floor with --threshold; the command exits 6 when coverage is below it. Use --format json for a machine-readable report, and --no-suite to print the bare taxonomy (combine with --probe to score an out-of-band pipeline).

Fixtures

Every family ships a runnable fixture under examples/, so you can see each one exercised in isolation: examples/negative-path.yml (PROTO/SCHEMA), examples/fuzz/ (TOOL), examples/output-schema-conformance.yml (SCHEMA), examples/test-isolation.yml (STATE), examples/agent-weather.yml and examples/distractor-tools/ (PROVIDER), examples/security/ and examples/trust-boundary/ (SECURITY), examples/fault-injection-recovery.yml (TIMEOUT), and examples/config-enforcement/ (CONFIG).

Extending the taxonomy

The taxonomy lives in crates/mcptest-core/src/runtime_faults/registry.yml and is embedded in the binary, so --registry <path> can point at a customized copy. To add a fault: pick a category, give the leaf a FAULT-<CATEGORY>-<NAME> id, list the probes that exercise it, point fixture at an example, and run cargo test -p mcptest-core runtime_faults to confirm it loads.

Execution safety policy

Some mcptest features execute real tool calls with synthesized arguments: suite scaffolding, assertion proposal, and the probe tier. Against a tool like delete_file or a production SaaS backend, that is an agent autonomously causing side effects. The execution safety policy in mcptest-core::exec_policy is the single layer those features consult before calling anything.

Tool classification

Every tool from tools/list is classified before any call is planned. Explicit MCP tool annotations (the annotations object on the tool descriptor) always win over the name heuristic.

SourceConditionClass
AnnotationreadOnlyHint: trueReadOnly
AnnotationdestructiveHint: trueDestructive
AnnotationidempotentHint: falseMutating
Name heuristicdestructive-looking wordDestructive
Name heuristicmutating-looking wordMutating
Name heuristicanything elseReadOnlyPresumed

ReadOnly and ReadOnlyPresumed are kept distinct so callers can tell "the server declared this read-only" apart from "we presume it is". Among annotations, read-only is checked first (the spec defines the other hints as meaningful only when it is false), then destructive, then non-idempotent. A malformed annotations object (for example "destructiveHint": "yes") is ignored and the name heuristic decides; the description lints flag the malformed object separately.

The name heuristic

Unannotated tool names are split into lowercase words at separators and camelCase boundaries (deleteFile, delete-file, and delete_file all contain the word delete), then matched against two word lists:

A destructive word outranks a mutating word in the same name (create_or_delete is Destructive).

What each class means at execution time

ExecutionPolicy::decide maps a class to one of four decisions:

Setting execute_destructive (a CLI flag) downgrades Destructive to ExecuteOnce. Even with the override, destructive tools are never double-called.

Policy knobs and defaults

KnobDefaultMeaning
execute_destructivefalseAllow executing destructive tools.
max_callsunlimitedTotal tool-call budget for the run.
concurrency2Maximum calls in flight at once.
call_delay100msPolite pause between HTTP calls.

The call budget is a thread-safe counter (CallBudget). Every planned call must acquire from it first; under concurrency exactly max_calls acquisitions succeed and the rest fail with a typed BudgetExhausted error carrying the limit, so a run can stop scheduling cleanly. The delay applies between consecutive HTTP calls; callers decide whether the target transport is HTTP. Stdio targets may ignore it.

Example

The policy reads the tool descriptors a server returns from tools/list, so the way to steer it is the annotations object on each tool. These two tools classify in opposite directions:

mock_server:
  name: records
  tools:
    # readOnlyHint wins over the name heuristic, so this runs freely
    # (stability double-calls included): ReadOnly -> Execute.
    - name: search_records
      description: "Find records matching a query."
      annotations:
        readOnlyHint: true
      response:
        content:
          - type: text
            text: "0 records"
    # No annotation, and the name contains "delete", so the heuristic
    # classifies it Destructive: GenerateOnly when a feature can emit a
    # test instead of running it, Refuse when a live call is required.
    - name: delete_record
      description: "Delete a record by id."
      response:
        content:
          - type: text
            text: "deleted"

Serve it with mcptest mock --tools-from records.yaml and point a synthesizing feature (scaffolding, proposal, or the probe tier) at it: the read-only tool is exercised, the destructive one is held back behind the # review before first run marker or skipped with a typed reason.

mcptest never cleans up after a mutating or destructive test. If a generated or probed call creates a record, sends a message, or uploads a file, removing that data afterwards is the developer's responsibility. Run synthesized suites against disposable or staging targets, not production.

Concurrent-session correctness

mcptest concurrency opens several sessions against a server at once and checks it stays correct under concurrency. It is a correctness check, not a load test: mcptest does not measure throughput or latency (use k6, wrk, or vegeta for that). It checks the things a generic HTTP load tool cannot, and it works over stdio as well as HTTP.

# Four concurrent stdio sessions (the default).
mcptest concurrency "python my_server.py"

# Eight concurrent sessions against an HTTP server, as JSON, gated.
mcptest concurrency https://example.com/mcp --sessions 8 --json

The server argument is an http(s):// URL (streamable HTTP) or a stdio command line (anything else), the same shape mcptest conformance run --server uses.

It runs one baseline session, then N concurrent ones, each listing the catalog, and reports three failure families:

It exits non-zero when any session fails, unless --no-fail is set. The report is a correctness verdict, not a latency histogram. A load tester measures how fast a server is; this measures whether it stays correct when several clients talk to it at once: whether one session's state bleeds into another, whether interleaved requests corrupt a response, whether the server wedges. Those are correctness properties, and they are exactly what a throughput tool does not check.