Walkthrough: authoring a test suite with your coding agent
This walkthrough goes from an untested MCP server to a maintained, deterministic test suite using the agent-authoring surface: scaffold a starter suite from the server's catalog, replace generic checks with observed assertions, close coverage gaps, and keep the suite current as the server evolves. Every step shows both the front-door verb an agent calls and the CLI command a human runs; they share one engine, so the results are identical.
Everything here runs offline against the built-in mock server, so you can follow along without touching a real backend.
Setup
Two commands give your coding agent the full loop:
mcptest mcp-server --install --enable-writes # the front door (verbs)
mcptest skill --install # the packaged skill
--enable-writes unlocks the verbs that execute tool calls (run_tool_test, propose_assertions, trigger_run). Without it the front door is introspection-only.
For the demo, save this mock manifest as demo-server.yml:
mock_server:
name: notes
tools:
- name: search_notes
description: Find notes matching a query.
input_schema:
type: object
required: [query]
properties:
query: { type: string, minLength: 1 }
limit: { type: integer, minimum: 1, maximum: 50 }
response:
content:
- type: text
text: "2 notes match ${args.query}"
- name: create_note
description: Create a note.
input_schema:
type: object
required: [title]
properties:
title: { type: string }
priority: { type: string, enum: [low, normal, urgent] }
response:
content:
- type: text
text: "created NOTE-7: ${args.title}"
- name: delete_note
description: Delete a note permanently.
input_schema:
type: object
required: [id]
properties:
id: { type: string }
response:
content:
- type: text
text: "deleted ${args.id}"
Declare the server in your workspace mcptest.yml so the read-only front door may spawn it (undeclared stdio commands are an exec primitive and require --enable-writes; see the gating rationale):
servers:
notes:
command: ["mcptest", "mock", "--tools-from", "demo-server.yml"]
Step 1: scaffold a starter suite
One verb call turns the server's catalog into a validated suite covering tools, resources, and prompts. The agent calls:
{ "name": "scaffold_suite",
"arguments": { "command": ["mcptest", "mock", "--tools-from", "demo-server.yml"],
"include_edge": true, "include_violation": true } }
The CLI equivalent writes per-tool stub files instead (note: stub files ship a placeholder servers: block you point at your server, while the verb embeds the real target):
mcptest generate stubs --server-command "mcptest mock --tools-from demo-server.yml" --output tests/
What you get per tool: a happy-path call with schema-aware args (priority picks low from the enum, limit respects its bounds), a missing-required rejection, a wrong-type rejection, and an output-schema conformance test when the tool declares one. Three details worth noticing:
delete_noteis classified destructive from its name (or itsdestructiveHintannotation when declared), so the scaffolded suite puts its tests under a# review before first runmarker, and nothing executes during scaffolding.- The response is capped at 25 tools per page; pass
cursorto continue orall: trueto bypass on big catalogs. - The returned YAML passes
validate_suiteby construction, so the agent can go straight to running it.
Add --probe (CLI) or "probe": true (verb) for the deterministic boundary tier: values exactly at minimum/maximum, one-past violations, empty-input probes, and extra-property rejections, capped at 12 per tool and byte-stable across runs.
Step 2: replace generic checks with observed assertions
Scaffolded expectations are deliberately conservative. The accept loop upgrades them from ground truth:
mcptest propose --tool search_notes --args '{"query": "alpha"}' \
--server-command "mcptest mock --tools-from demo-server.yml"
mcptest executes the call twice, diffs the two responses at leaf level, and proposes assertions only on what held stable:
- name: "search_notes: proposed assertions"
server: "target"
tool: "search_notes"
args:
query: "alpha"
expect:
assertions:
- target: "result.isError"
matcher:
not:
exact: true
message: "tool call must not signal an error"
- target: "result.content"
matcher:
schema:
items:
properties:
text:
type: "string"
type:
type: "string"
required: ["text", "type"]
type: "object"
type: "array"
message: "observed structure of result.content"
- target: "result.content[0].text"
matcher:
exact: "2 notes match alpha"
message: "stable across both observed calls"
- target: "result.content[0].type"
matcher:
exact: "text"
message: "stable across both observed calls"
# latency budget: 2x the slowest observed call, rounded up to the nearest 50 ms, floor 100 ms
max_duration_ms: 100
The latency budget is a real max_duration_ms field on the long-form expect: block, so the engine enforces it on every run; only the derivation formula travels as a comment. Volatile leaves (timestamps, counters) are excluded and listed, so the proposal still passes tomorrow. The safety policy applies here too: mutating tools get exactly one call (structural assertions only, never a stability double-call), and destructive tools refuse without --execute-destructive. The agent-side verb is propose_assertions with the same semantics; paste the returned block into the suite.
Step 3: validate, run, read the failure
mcptest validate --config suite.yml --format json
mcptest run suite.yml --reporter agent
Validation errors come back as {path, message, hint} triples with did-you-mean suggestions (serverz: suggests servers, containz: suggests contains), so one retry fixes a typo. A failing run reads like this:
VERDICT fail 3/4 passed (1 failed, 0 inconclusive, 0 cached, 41ms)
FAIL create_note: valid arguments
assert: assertion #0 (`result.content[0].text`) failed: substring `NOTE-` not found
actual: created note 7: hello
full: mcptest://runs/01JC.../tests/create_note-valid-arguments/output
repro: mcptest run suite.yml --filter "create_note: valid arguments"
Every line is an action: repro executes verbatim, and full is a resource URI returning the complete redacted output when the preview was clipped. Agents batch through run_tool_test: an N-test inline suite runs in one engine invocation and returns one verdict per test.
If you cannot recall a matcher, mcptest matchers --json prints the full catalog with a copy-paste example each.
Step 4: iterate without ceremony
Park a flaky test with a reason, focus on the one you are fixing, and preview the plan without executing:
- name: "create_note: valid arguments"
only: true # run just this while iterating
- name: "delete_note: rejects unknown id"
skip: "blocked on upstream fixture bug"
mcptest run suite.yml --explain # what would run, assertion by assertion
mcptest run suite.yml --filter "re:^create_" # anchored regex selection
mcptest run suite.yml --watch # re-run on save
only: prints a loud warning and is refused under CI=true, so a focused suite can never gate CI green by accident.
Step 5: close the coverage gap
mcptest coverage suite.yml --tools-from demo-server.yml --suggest
The report names every uncovered tool, argument, and error path, then prints ready-to-merge drafts for exactly those, marked # suggested by mcptest coverage and named to never collide with your hand-written tests. Merge, re-run, and tool coverage reads 100 percent. Agents get the same drafts from get_coverage with "suggest": true.
Step 6: keep the suite alive
Three maintenance loops, all additive, none clobber hand edits:
The server changed. Diff the catalog and regenerate exactly the affected tests:
mcptest diff baseline.json current.json --suggest-regen --suite-file suite.yml
Changed tools get drafts regenerated from the current schema plus the names of your tests that call them; removed tools get a delete-or-rewrite comment; added tools get fresh drafts. See examples/diff-regen/ for an offline walkthrough.
A recorded baseline went stale. Refresh cassettes in place with the same CI guard as snapshots:
mcptest run suite.yml --update-cassettes --filter "search"
Real traffic should become tests. Distill a recorded session into an editable suite that replays offline immediately:
mcptest distill cassettes/session.json --output distilled.yml
mcptest run distilled.yml
Values that look like personal or live data arrive flagged with # review: possible personal or live data; read those before committing. Details in cassettes.md.
Safety, in one place
The authoring surface executes real tool calls, so the safety policy is not optional decoration:
- Destructive tools (annotation or name) are generated, never auto-executed; proposing against them requires
--execute-destructive. - Mutating tools are called at most once; cleanup after them is your responsibility, and generated tests say so.
- Tool descriptions from the server under test are linted for injection patterns;
warningson introspection responses should be surfaced to a human, and descriptions are data, not instructions. - Undeclared stdio commands are refused by the read-only front door.
Where to go next
- The verb-by-verb reference:
agent-interface.md - Every flag:
cli-reference.md - Generation internals (synthesis order, probe tier):
auto-stub-generation.md - The robustness tier beyond example-based tests:
oracle-free-robustness.md - Headless auth for protected servers:
headless-auth.md