Scenario 10: grade a server against the spec
A passing test suite tells you your server does what you asked it to do. It does not tell you whether your server follows the MCP specification. Those are different questions. A tool can return exactly the text you expect while the initialize response omits a capabilities block, or while a tools/call result returns no content blocks at all. Clients in the wild break on exactly these shapes.
mcptest compliance run answers the spec question. It drives a fixed corpus of 69 rules across five sections (lifecycle, tools, resources, prompts, transport-auth) and reports a single letter grade from A+ to F, broken out per section. Rule failures are reported by stable IDs like PROTO-002, TOOL-005, SCHEMA-005, and HEADER-002, so you can look up exactly which clause your server tripped.
This scenario grades two endpoints on the hosted test server: the conformant one, which scores well, and a deliberately broken one, which fails three MUST rules and drops to a low grade. No local server and no credentials are needed.
The YAML
Save this as tests/spec-grade.yml:
# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json
servers:
conformant:
url: "https://test.mcptest.sh/mcp"
broken:
url: "https://test.mcptest.sh/mcp?scenario=invalid"
compliance:
- name: "initialize handshake"
server: conformant
check: "initialize"
- name: "tools list shape"
server: conformant
check: "tools/list"
What is happening here:
- The
conformantserver points athttps://test.mcptest.sh/mcp, the hosted test server's well-behaved endpoint. It returns a validprotocolVersion, a populatedcapabilitiesobject, and tool results with real content blocks. - The
brokenserver points at the same host with?scenario=invalid. That endpoint returns spec-violating responses on purpose: a malformedprotocolVersion, nocapabilitiesblock in theinitializeresponse, and atools/callresult with an emptycontentfield. - The
compliance:block lists named checks.check: "initialize"drives the lifecycle handshake;check: "tools/list"drives tool discovery. These named checks feed the corpus the live data it grades against. compliance runscores the full 69-rule corpus against the session, not just the two named checks. The named checks are how the runner exercises the server so the corpus has responses to grade.
Run it (clean)
Grade the conformant endpoint:
mcptest compliance run --from-suite tests/spec-grade.yml \
--server-label "test.mcptest.sh (conformant)"
--from-suite points the runner at the suite, drives the live server, and grades the corpus against the responses it collected. --server-label is the name printed in the report header. The default output is color-tinted text with the grade on the first lines; add --format markdown, --format json, or --format html for a CI summary, a machine consumer, or a self-contained report.
Run it against the non-conformant endpoint
Point the same run at the broken server. The suite already declares it, so override which server the named checks target by labeling the run and swapping the suite's server: to broken, or keep a second suite. The simplest path is the global --server-url override, which repoints the whole suite at one URL:
mcptest compliance run --from-suite tests/spec-grade.yml \
--server-url "https://test.mcptest.sh/mcp?scenario=invalid" \
--server-label "test.mcptest.sh (scenario=invalid)"
This run exits 1 because spec-violating MUST rules fail without a baseline to suppress them.
Expected output
The clean run scores high. Every applicable MUST passes; one optional transport SHOULD (HEADER-002, the X-Server-Version response header) is missing, which holds it at A rather than A+:
mcptest compliance against test.mcptest.sh (conformant)
Rubric version: v2025-06-18
Lifecycle [A+] MUST 4/4, SHOULD 2/2, MAY 1/1
Tools [A+] MUST 5/5, SHOULD 3/3, MAY 1/1
Resources [A] MUST 3/3, SHOULD 2/2, MAY 0/1
Prompts [A+] MUST 2/2, SHOULD 1/1, MAY 1/1
Transport [A] MUST 4/4, SHOULD 3/4, MAY 1/1
Grade: A
Score: 96.4/100
Completeness: 92%
Currently A. To reach A+, implement these MAY rules: RES-006.
The broken run fails the lifecycle and tools MUSTs and drops to D. A single failed basic-conformance MUST is enough to cap the grade well below A, and the per-section table shows exactly where:
mcptest compliance against test.mcptest.sh (scenario=invalid)
Rubric version: v2025-06-18
Lifecycle [D] MUST 2/4, SHOULD 2/2, MAY 0/1
Tools [C] MUST 4/5, SHOULD 3/3, MAY 0/1
Resources [A] MUST 3/3, SHOULD 2/2, MAY 1/1
Prompts [A] MUST 2/2, SHOULD 1/1, MAY 0/1
Transport [B] MUST 4/4, SHOULD 2/4, MAY 0/1
Grade: D
Score: 61.2/100
Completeness: 64%
Failed MUST checks:
PROTO-002
PROTO-005
TOOL-005
Currently D. Pass at least 50% of SHOULD rules to reach C.
Reading the failures:
PROTO-002(initialize returns negotiated capabilities) fails because the?scenario=invalidendpoint omits thecapabilitiesblock.PROTO-005(JSON-RPC version is always 2.0) and the lifecycle handshake rules trip on the malformedprotocolVersionthe endpoint returns.TOOL-005(tools/call result content is a non-empty array of content blocks) fails because the broken endpoint returns a result with emptycontent.
Each ID maps to one clause of the corpus, so you can look up the exact rule, read its spec citation, and fix the response shape it flags.
Baseline the known failures
If your own server fails a rule you cannot fix this sprint, you do not have to leave CI red. Declare the known failures in a baseline file and gate only on new regressions:
mcptest compliance run --from-suite tests/spec-grade.yml \
--baseline tests/compliance-baseline.yml
A failing rule listed in the baseline counts as an expected failure and exits 0. A rule that fails without being listed is a new regression and exits 1. A baselined rule that starts passing is a stale entry and also exits 1, so the file gets trimmed as you fix things. The full pattern, including the short and long entry forms, is in docs/compliance-baseline.md.
Troubleshooting
- Both runs grade the same endpoint.
--server-urlrepoints the entire suite, so it wins over the suite'sservers:entries. To grade the conformant and broken endpoints in one invocation, drop the override and keep two named servers, or run the command twice with distinct--server-urlvalues and--server-labelheaders. - A whole section reads
N/A (capability not declared). Section applicability follows the capabilities the server advertised duringinitialize. If the broken endpoint omitscapabilities, the tools, resources, and prompts sections may not be evaluated at all. Pass--capabilities tools,resources,promptsto force those sections in when you know the server should support them. - The grade is lower than the failing-rule count suggests. A single failed basic protocol-conformance MUST caps the grade hard. One lifecycle MUST failure can pull an otherwise clean server down to D or F regardless of how many other checks pass. Read the per-section table, not just the headline letter.
- You want a machine-readable result. Add
--format jsonto emit the same score envelope as a JSON document, or--format htmlfor a self-contained report you can attach to a CI run.
See also
docs/compliance-grade.md, the grade bands and how the letter is derived from the corpus.docs/compliance-baseline.md, the expected-failures pattern for gating CI on new regressions only.- Previous: Scan for attacks.
- Next: Test behind OAuth.