mcptest docs GitHub

Research grounding

mcptest is a research-anchored project, not an ad hoc one. Three audiences read this page. Researchers should be able to trace any design decision back to the literature that motivated it. Evaluators should be able to confirm that the methodology is defensible and grounded in peer-reviewed and preprint work, not vibes. Contributors should be able to see, before touching a matcher or a reporter, the prior art that shaped how that piece behaves. Every citation below links to a specific mcptest design decision via the "Informs" line at the end of the entry.

Each entry carries a disposition: adopt (shipped or being built), partial (some of the idea already ships), and watch (tracked but demand-gated, so nothing is committed yet). This keeps the page honest about what is shipped versus on the roadmap.

LLM-as-judge and LLM-as-jury

These works inform the W7 milestone, where mcptest will support an optional "judge" matcher that scores tool output with a panel of small models rather than a single large one.

Judge bias literature

These works inform the bias-mitigation knobs the judge matcher will expose, including pairwise randomization, length normalization, and self-preference controls.

MCP-specific empirical research

These works ground the doctor checks and the cost-and-latency reporting in measured behavior of real MCP servers, not in folklore.

MCP agent benchmarks

These benchmarks set the bar that mcptest's example suite and doctor checks align against. Where a benchmark exposes a public test corpus we use it as ground truth for the matcher library.

MCP security

These works build the MCP threat taxonomy, the threat-benchmark corpora, the attack analyses, and the defense signals that the doctor checks and the scorecard's defense-posture view draw on. Surveys and systematization come first, then threat benchmarks, attack analyses, and defenses.

Considered, not adopted

These were scanned in the same May 2026 sweep but fall outside mcptest's testing scope.

Security testing frameworks and red-team tooling

These inform the multi-layer security testing design: how checks and attacks are structured, and which external engines mcptest runs.

CI testing methodology

These references shape how mcptest fits into a developer's CI loop. The philosophy is three-layer regression (functional, contract, performance) with strict false-positive budgets.

LLM regression testing for model migration

These works inform the future W8 milestone, which will let teams pin a test suite to a model version and detect drift when they upgrade.

LLM application testing philosophy

The two papers below shape the overall stance of mcptest as a testing tool for LLM-adjacent systems: deterministic gates first, judged behavior second.

Lifecycle and evaluation framing

These works frame mcptest's whole-lifecycle positioning, from the evaluation drivers that map onto the scorecard to the observability layer that records what happened.

Industry tool-use and code-mode guidance

Vendor engineering guidance on how models should use tools. It shapes what mcptest checks: tool-definition quality, token efficiency, and the code-mode access pattern.

Update cadence

Refreshed quarterly. PRs welcome via the mcptest repo. Contribution rules live in AGENTS.md.