mcptest docs GitHub

Scenario 10: grade a server against the spec

A passing test suite tells you your server does what you asked it to do. It does not tell you whether your server follows the MCP specification. Those are different questions. A tool can return exactly the text you expect while the initialize response omits a capabilities block, or while a tools/call result returns no content blocks at all. Clients in the wild break on exactly these shapes.

mcptest compliance run answers the spec question. It drives a fixed corpus of 69 rules across five sections (lifecycle, tools, resources, prompts, transport-auth) and reports a single letter grade from A+ to F, broken out per section. Rule failures are reported by stable IDs like PROTO-002, TOOL-005, SCHEMA-005, and HEADER-002, so you can look up exactly which clause your server tripped.

This scenario grades two endpoints on the hosted test server: the conformant one, which scores well, and a deliberately broken one, which fails three MUST rules and drops to a low grade. No local server and no credentials are needed.

The YAML

Save this as tests/spec-grade.yml:

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json

servers:
  conformant:
    url: "https://test.mcptest.sh/mcp"
  broken:
    url: "https://test.mcptest.sh/mcp?scenario=invalid"

compliance:
  - name: "initialize handshake"
    server: conformant
    check: "initialize"
  - name: "tools list shape"
    server: conformant
    check: "tools/list"

What is happening here:

Run it (clean)

Grade the conformant endpoint:

mcptest compliance run --from-suite tests/spec-grade.yml \
  --server-label "test.mcptest.sh (conformant)"

--from-suite points the runner at the suite, drives the live server, and grades the corpus against the responses it collected. --server-label is the name printed in the report header. The default output is color-tinted text with the grade on the first lines; add --format markdown, --format json, or --format html for a CI summary, a machine consumer, or a self-contained report.

Run it against the non-conformant endpoint

Point the same run at the broken server. The suite already declares it, so override which server the named checks target by labeling the run and swapping the suite's server: to broken, or keep a second suite. The simplest path is the global --server-url override, which repoints the whole suite at one URL:

mcptest compliance run --from-suite tests/spec-grade.yml \
  --server-url "https://test.mcptest.sh/mcp?scenario=invalid" \
  --server-label "test.mcptest.sh (scenario=invalid)"

This run exits 1 because spec-violating MUST rules fail without a baseline to suppress them.

Expected output

The clean run scores high. Every applicable MUST passes; one optional transport SHOULD (HEADER-002, the X-Server-Version response header) is missing, which holds it at A rather than A+:

mcptest compliance against test.mcptest.sh (conformant)
  Rubric version: v2025-06-18

  Lifecycle      [A+]   MUST 4/4, SHOULD 2/2, MAY 1/1
  Tools          [A+]   MUST 5/5, SHOULD 3/3, MAY 1/1
  Resources      [A]    MUST 3/3, SHOULD 2/2, MAY 0/1
  Prompts        [A+]   MUST 2/2, SHOULD 1/1, MAY 1/1
  Transport      [A]    MUST 4/4, SHOULD 3/4, MAY 1/1

Grade: A
Score: 96.4/100
Completeness: 92%

Currently A. To reach A+, implement these MAY rules: RES-006.

The broken run fails the lifecycle and tools MUSTs and drops to D. A single failed basic-conformance MUST is enough to cap the grade well below A, and the per-section table shows exactly where:

mcptest compliance against test.mcptest.sh (scenario=invalid)
  Rubric version: v2025-06-18

  Lifecycle      [D]    MUST 2/4, SHOULD 2/2, MAY 0/1
  Tools          [C]    MUST 4/5, SHOULD 3/3, MAY 0/1
  Resources      [A]    MUST 3/3, SHOULD 2/2, MAY 1/1
  Prompts        [A]    MUST 2/2, SHOULD 1/1, MAY 0/1
  Transport      [B]    MUST 4/4, SHOULD 2/4, MAY 0/1

Grade: D
Score: 61.2/100
Completeness: 64%

Failed MUST checks:
  PROTO-002
  PROTO-005
  TOOL-005

Currently D. Pass at least 50% of SHOULD rules to reach C.

Reading the failures:

Each ID maps to one clause of the corpus, so you can look up the exact rule, read its spec citation, and fix the response shape it flags.

Baseline the known failures

If your own server fails a rule you cannot fix this sprint, you do not have to leave CI red. Declare the known failures in a baseline file and gate only on new regressions:

mcptest compliance run --from-suite tests/spec-grade.yml \
  --baseline tests/compliance-baseline.yml

A failing rule listed in the baseline counts as an expected failure and exits 0. A rule that fails without being listed is a new regression and exits 1. A baselined rule that starts passing is a stale entry and also exits 1, so the file gets trimmed as you fix things. The full pattern, including the short and long entry forms, is in docs/compliance-baseline.md.

Troubleshooting

See also