Scenario 10: grade a server against the spec

A passing test suite tells you your server does what you asked it to do. It does not tell you whether your server follows the MCP specification. Those are different questions. A tool can return exactly the text you expect while the initialize response omits a capabilities block, or while a tools/call result returns no content blocks at all. Clients in the wild break on exactly these shapes.

mcptest compliance run answers the spec question. It drives a fixed corpus of 69 rules across five sections (lifecycle, tools, resources, prompts, transport-auth) and reports a single letter grade from A+ to F, broken out per section. Rule failures are reported by stable IDs like PROTO-002, TOOL-005, SCHEMA-005, and HEADER-002, so you can look up exactly which clause your server tripped.

This scenario grades two endpoints on the hosted test server: the conformant one, which scores well, and a deliberately broken one, which fails three MUST rules and drops to a low grade. No local server and no credentials are needed.

The YAML

Save this as tests/spec-grade.yml:

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json

servers:
  conformant:
    url: "https://test.mcptest.sh/mcp"
  broken:
    url: "https://test.mcptest.sh/mcp?scenario=invalid"

compliance:
  - name: "initialize handshake"
    server: conformant
    check: "initialize"
  - name: "tools list shape"
    server: conformant
    check: "tools/list"

What is happening here:

The conformant server points at https://test.mcptest.sh/mcp, the hosted test server's well-behaved endpoint. It returns a valid protocolVersion, a populated capabilities object, and tool results with real content blocks.
The broken server points at the same host with ?scenario=invalid. That endpoint returns spec-violating responses on purpose: a malformed protocolVersion, no capabilities block in the initialize response, and a tools/call result with an empty content field.
The compliance: block lists named checks. check: "initialize" drives the lifecycle handshake; check: "tools/list" drives tool discovery. These named checks feed the corpus the live data it grades against.
compliance run scores the full 69-rule corpus against the session, not just the two named checks. The named checks are how the runner exercises the server so the corpus has responses to grade.

Run it (clean)

Grade the conformant endpoint:

mcptest compliance run --from-suite tests/spec-grade.yml \
  --server-label "test.mcptest.sh (conformant)"

--from-suite points the runner at the suite, drives the live server, and grades the corpus against the responses it collected. --server-label is the name printed in the report header. The default output is color-tinted text with the grade on the first lines; add --format markdown, --format json, or --format html for a CI summary, a machine consumer, or a self-contained report.

Run it against the non-conformant endpoint

Point the same run at the broken server. The suite already declares it, so override which server the named checks target by labeling the run and swapping the suite's server: to broken, or keep a second suite. The simplest path is the global --server-url override, which repoints the whole suite at one URL:

mcptest compliance run --from-suite tests/spec-grade.yml \
  --server-url "https://test.mcptest.sh/mcp?scenario=invalid" \
  --server-label "test.mcptest.sh (scenario=invalid)"

This run exits 1 because spec-violating MUST rules fail without a baseline to suppress them.

Expected output

The clean run scores high. Every applicable MUST passes; one optional transport SHOULD (HEADER-002, the X-Server-Version response header) is missing, which holds it at A rather than A+:

mcptest compliance against test.mcptest.sh (conformant)
  Rubric version: v2025-06-18

  Lifecycle      [A+]   MUST 4/4, SHOULD 2/2, MAY 1/1
  Tools          [A+]   MUST 5/5, SHOULD 3/3, MAY 1/1
  Resources      [A]    MUST 3/3, SHOULD 2/2, MAY 0/1
  Prompts        [A+]   MUST 2/2, SHOULD 1/1, MAY 1/1
  Transport      [A]    MUST 4/4, SHOULD 3/4, MAY 1/1

Grade: A
Score: 96.4/100
Completeness: 92%

Currently A. To reach A+, implement these MAY rules: RES-006.

The broken run fails the lifecycle and tools MUSTs and drops to D. A single failed basic-conformance MUST is enough to cap the grade well below A, and the per-section table shows exactly where:

mcptest compliance against test.mcptest.sh (scenario=invalid)
  Rubric version: v2025-06-18

  Lifecycle      [D]    MUST 2/4, SHOULD 2/2, MAY 0/1
  Tools          [C]    MUST 4/5, SHOULD 3/3, MAY 0/1
  Resources      [A]    MUST 3/3, SHOULD 2/2, MAY 1/1
  Prompts        [A]    MUST 2/2, SHOULD 1/1, MAY 0/1
  Transport      [B]    MUST 4/4, SHOULD 2/4, MAY 0/1

Grade: D
Score: 61.2/100
Completeness: 64%

Failed MUST checks:
  PROTO-002
  PROTO-005
  TOOL-005

Currently D. Pass at least 50% of SHOULD rules to reach C.

Reading the failures:

PROTO-002 (initialize returns negotiated capabilities) fails because the ?scenario=invalid endpoint omits the capabilities block.
PROTO-005 (JSON-RPC version is always 2.0) and the lifecycle handshake rules trip on the malformed protocolVersion the endpoint returns.
TOOL-005 (tools/call result content is a non-empty array of content blocks) fails because the broken endpoint returns a result with empty content.

Each ID maps to one clause of the corpus, so you can look up the exact rule, read its spec citation, and fix the response shape it flags.

Baseline the known failures

If your own server fails a rule you cannot fix this sprint, you do not have to leave CI red. Declare the known failures in a baseline file and gate only on new regressions:

mcptest compliance run --from-suite tests/spec-grade.yml \
  --baseline tests/compliance-baseline.yml

A failing rule listed in the baseline counts as an expected failure and exits 0. A rule that fails without being listed is a new regression and exits 1. A baselined rule that starts passing is a stale entry and also exits 1, so the file gets trimmed as you fix things. The full pattern, including the short and long entry forms, is in docs/compliance-baseline.md.

Troubleshooting

Both runs grade the same endpoint. --server-url repoints the entire suite, so it wins over the suite's servers: entries. To grade the conformant and broken endpoints in one invocation, drop the override and keep two named servers, or run the command twice with distinct --server-url values and --server-label headers.
A whole section reads N/A (capability not declared). Section applicability follows the capabilities the server advertised during initialize. If the broken endpoint omits capabilities, the tools, resources, and prompts sections may not be evaluated at all. Pass --capabilities tools,resources,prompts to force those sections in when you know the server should support them.
The grade is lower than the failing-rule count suggests. A single failed basic protocol-conformance MUST caps the grade hard. One lifecycle MUST failure can pull an otherwise clean server down to D or F regardless of how many other checks pass. Read the per-section table, not just the headline letter.
You want a machine-readable result. Add --format json to emit the same score envelope as a JSON document, or --format html for a self-contained report you can attach to a CI run.