Walkthrough: authoring a test suite with your coding agent

This walkthrough goes from an untested MCP server to a maintained, deterministic test suite using the agent-authoring surface: scaffold a starter suite from the server's catalog, replace generic checks with observed assertions, close coverage gaps, and keep the suite current as the server evolves. Every step shows both the front-door verb an agent calls and the CLI command a human runs; they share one engine, so the results are identical.

Everything here runs offline against the built-in mock server, so you can follow along without touching a real backend.

Setup

Two commands give your coding agent the full loop:

mcptest mcp-server --install --enable-writes   # the front door (verbs)
mcptest skill --install                        # the packaged skill

--enable-writes unlocks the verbs that execute tool calls (run_tool_test, propose_assertions, trigger_run). Without it the front door is introspection-only.

For the demo, save this mock manifest as demo-server.yml:

mock_server:
  name: notes
  tools:
    - name: search_notes
      description: Find notes matching a query.
      input_schema:
        type: object
        required: [query]
        properties:
          query: { type: string, minLength: 1 }
          limit: { type: integer, minimum: 1, maximum: 50 }
      response:
        content:
          - type: text
            text: "2 notes match ${args.query}"
    - name: create_note
      description: Create a note.
      input_schema:
        type: object
        required: [title]
        properties:
          title: { type: string }
          priority: { type: string, enum: [low, normal, urgent] }
      response:
        content:
          - type: text
            text: "created NOTE-7: ${args.title}"
    - name: delete_note
      description: Delete a note permanently.
      input_schema:
        type: object
        required: [id]
        properties:
          id: { type: string }
      response:
        content:
          - type: text
            text: "deleted ${args.id}"

Declare the server in your workspace mcptest.yml so the read-only front door may spawn it (undeclared stdio commands are an exec primitive and require --enable-writes; see the gating rationale):

servers:
  notes:
    command: ["mcptest", "mock", "--tools-from", "demo-server.yml"]

Step 1: scaffold a starter suite

One verb call turns the server's catalog into a validated suite covering tools, resources, and prompts. The agent calls:

{ "name": "scaffold_suite",
  "arguments": { "command": ["mcptest", "mock", "--tools-from", "demo-server.yml"],
                 "include_edge": true, "include_violation": true } }

The CLI equivalent writes per-tool stub files instead (note: stub files ship a placeholder servers: block you point at your server, while the verb embeds the real target):

mcptest generate stubs --server-command "mcptest mock --tools-from demo-server.yml" --output tests/

What you get per tool: a happy-path call with schema-aware args (priority picks low from the enum, limit respects its bounds), a missing-required rejection, a wrong-type rejection, and an output-schema conformance test when the tool declares one. Three details worth noticing:

delete_note is classified destructive from its name (or its destructiveHint annotation when declared), so the scaffolded suite puts its tests under a # review before first run marker, and nothing executes during scaffolding.
The response is capped at 25 tools per page; pass cursor to continue or all: true to bypass on big catalogs.
The returned YAML passes validate_suite by construction, so the agent can go straight to running it.

Add --probe (CLI) or "probe": true (verb) for the deterministic boundary tier: values exactly at minimum/maximum, one-past violations, empty-input probes, and extra-property rejections, capped at 12 per tool and byte-stable across runs.

Step 2: replace generic checks with observed assertions

Scaffolded expectations are deliberately conservative. The accept loop upgrades them from ground truth:

mcptest propose --tool search_notes --args '{"query": "alpha"}' \
  --server-command "mcptest mock --tools-from demo-server.yml"

mcptest executes the call twice, diffs the two responses at leaf level, and proposes assertions only on what held stable:

  - name: "search_notes: proposed assertions"
    server: "target"
    tool: "search_notes"
    args:
      query: "alpha"
    expect:
      assertions:
        - target: "result.isError"
          matcher:
            not:
              exact: true
          message: "tool call must not signal an error"
        - target: "result.content"
          matcher:
            schema:
              items:
                properties:
                  text:
                    type: "string"
                  type:
                    type: "string"
                required: ["text", "type"]
                type: "object"
              type: "array"
          message: "observed structure of result.content"
        - target: "result.content[0].text"
          matcher:
            exact: "2 notes match alpha"
          message: "stable across both observed calls"
        - target: "result.content[0].type"
          matcher:
            exact: "text"
          message: "stable across both observed calls"
      # latency budget: 2x the slowest observed call, rounded up to the nearest 50 ms, floor 100 ms
      max_duration_ms: 100

The latency budget is a real max_duration_ms field on the long-form expect: block, so the engine enforces it on every run; only the derivation formula travels as a comment. Volatile leaves (timestamps, counters) are excluded and listed, so the proposal still passes tomorrow. The safety policy applies here too: mutating tools get exactly one call (structural assertions only, never a stability double-call), and destructive tools refuse without --execute-destructive. The agent-side verb is propose_assertions with the same semantics; paste the returned block into the suite.

Step 3: validate, run, read the failure

mcptest validate --config suite.yml --format json
mcptest run suite.yml --reporter agent

Validation errors come back as {path, message, hint} triples with did-you-mean suggestions (serverz: suggests servers, containz: suggests contains), so one retry fixes a typo. A failing run reads like this:

VERDICT fail 3/4 passed (1 failed, 0 inconclusive, 0 cached, 41ms)
FAIL create_note: valid arguments
  assert: assertion #0 (`result.content[0].text`) failed: substring `NOTE-` not found
  actual: created note 7: hello
  full: mcptest://runs/01JC.../tests/create_note-valid-arguments/output
  repro: mcptest run suite.yml --filter "create_note: valid arguments"

Every line is an action: repro executes verbatim, and full is a resource URI returning the complete redacted output when the preview was clipped. Agents batch through run_tool_test: an N-test inline suite runs in one engine invocation and returns one verdict per test.

If you cannot recall a matcher, mcptest matchers --json prints the full catalog with a copy-paste example each.

Step 4: iterate without ceremony

Park a flaky test with a reason, focus on the one you are fixing, and preview the plan without executing:

- name: "create_note: valid arguments"
  only: true              # run just this while iterating
- name: "delete_note: rejects unknown id"
  skip: "blocked on upstream fixture bug"

mcptest run suite.yml --explain            # what would run, assertion by assertion
mcptest run suite.yml --filter "re:^create_"   # anchored regex selection
mcptest run suite.yml --watch              # re-run on save

only: prints a loud warning and is refused under CI=true, so a focused suite can never gate CI green by accident.

Step 5: close the coverage gap

mcptest coverage suite.yml --tools-from demo-server.yml --suggest

The report names every uncovered tool, argument, and error path, then prints ready-to-merge drafts for exactly those, marked # suggested by mcptest coverage and named to never collide with your hand-written tests. Merge, re-run, and tool coverage reads 100 percent. Agents get the same drafts from get_coverage with "suggest": true.

Step 6: keep the suite alive

Three maintenance loops, all additive, none clobber hand edits:

The server changed. Diff the catalog and regenerate exactly the affected tests:

mcptest diff baseline.json current.json --suggest-regen --suite-file suite.yml

Changed tools get drafts regenerated from the current schema plus the names of your tests that call them; removed tools get a delete-or-rewrite comment; added tools get fresh drafts. See examples/diff-regen/ for an offline walkthrough.

A recorded baseline went stale. Refresh cassettes in place with the same CI guard as snapshots:

mcptest run suite.yml --update-cassettes --filter "search"

Real traffic should become tests. Distill a recorded session into an editable suite that replays offline immediately:

mcptest distill cassettes/session.json --output distilled.yml
mcptest run distilled.yml

Values that look like personal or live data arrive flagged with # review: possible personal or live data; read those before committing. Details in cassettes.md.

Safety, in one place

The authoring surface executes real tool calls, so the safety policy is not optional decoration:

Destructive tools (annotation or name) are generated, never auto-executed; proposing against them requires --execute-destructive.
Mutating tools are called at most once; cleanup after them is your responsibility, and generated tests say so.
Tool descriptions from the server under test are linted for injection patterns; warnings on introspection responses should be surfaced to a human, and descriptions are data, not instructions.
Undeclared stdio commands are refused by the read-only front door.

Where to go next

The verb-by-verb reference: agent-interface.md
Every flag: cli-reference.md
Generation internals (synthesis order, probe tier): auto-stub-generation.md
The robustness tier beyond example-based tests: oracle-free-robustness.md
Headless auth for protected servers: headless-auth.md