Scenario 7: LLM-judge matcher preview
The llm-judge matcher routes a tool's response through an LLM with a grading rubric and passes when the judge's score meets a threshold. It is the right tool for "the response should be correct" when correctness is too fuzzy for exact, contains, or regex. The cost: every assertion makes an extra LLM call, which costs money and adds latency.
v1.0 ships the YAML surface and a stub runner that returns NotYetAvailable. The full runner is planned for a later release. The schema accepts the shape today so you can author suites ahead of the runner; this scenario shows the intended end-to-end flow and the --max-cost safety cap that will gate the feature when it ships.
The literature grounding (which models to use, how to mitigate position bias, calibration techniques) is in docs/research-references.md.
The YAML
Save this as tests/judge.yml:
# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json
servers:
remote_api:
url: "https://mcp.example.com/v1"
auth:
bearer_token_env: "MCPTEST_API_TOKEN"
tools:
- name: "summary mentions the service and the release tag"
server: remote_api
tool: "summarize_deployment"
args:
service: "checkout"
release: "v2.31.4"
expect:
- target: "result.content[0].text"
matcher:
llm-judge:
rubric: |
The answer must mention the service name "checkout"
and the release tag "v2.31.4". Score 1.0 if both are
present and the summary is plausible; 0.5 if only one
is present; 0.0 if neither is present or the answer
is off-topic.
threshold: 0.7
model: "claude-sonnet-4-5"
- name: "error explanation is actionable"
server: remote_api
tool: "explain_error"
args:
code: "ERR_RATE_LIMIT"
expect:
- target: "result.content[0].text"
matcher:
llm-judge:
rubric: |
The answer must explain that the request was rate
limited and suggest a concrete next step (retry with
backoff, contact support, upgrade plan). Score 1.0 if
both are present, 0.5 if only one, 0.0 otherwise.
threshold: 0.7
Two llm-judge assertions. The first pins a judge model explicitly; the second lets the runner pick the default per the W7 design. The rubric is a free-form string handed to the judge.
The --max-cost safety cap
The mcptest eval subcommand (not mcptest run, which skips evals by default) gates the judge feature behind a budget:
# abort the run once accumulated judge cost would exceed $0.50
mcptest eval --max-cost 0.50 tests/judge.yml
--max-cost N is a hard ceiling in USD across every LLM-judge call in the run. The run aborts the moment accumulated cost would exceed N, so a misconfigured suite cannot rack up unbounded spend.
Expected output (after the full runner ships)
mcptest eval tests/judge.yml --max-cost 0.50
PASS summary mentions the service and the release tag (score 0.92, $0.0014, 1.8s)
PASS error explanation is actionable (score 0.81, $0.0011, 1.5s)
2 passed, 0 failed in 3.3s ($0.0025 total, under $0.50 cap)
On failure:
FAIL summary mentions the service and the release tag (score 0.55, $0.0014, 1.8s)
llm-judge: score 0.55 < threshold 0.70
explanation: response mentioned the service "checkout" but not the release tag "v2.31.4"
The reporter shows the score, the cost, and the judge's free-form explanation. The cost line at the end accumulates across the run so you can see the total at a glance.
Today's output (v1.0, runner stub)
Until the full runner lands, the matcher accepts the YAML and the schema validates the shape, but the runner returns NotYetAvailable:
mcptest eval tests/judge.yml
SKIP summary mentions the service and the release tag (llm-judge: NotYetAvailable)
SKIP error explanation is actionable (llm-judge: NotYetAvailable)
0 passed, 0 failed, 2 skipped in 4ms
SKIP is not a pass; the runner exits non-zero so CI gates do not accidentally promote a skipped judge to a green check. Once the full runner ships, the same YAML runs end-to-end without edits.
Practical advice
- Write rubrics like prompts. Concrete scoring criteria (
Score 1.0 if both X and Y are present, 0.5 if only one) work better than vague ones (The answer should be good). The judge is just another LLM; it benefits from explicit instructions. - Pin the judge model. A judge model change is a behavior change for your suite. The cache key does not include the judge model, because LLM-judge results are not cached at all (the eligibility engine excludes any test whose matcher set includes
llm-judge). Pinning the model is a docs concern, not a cache concern, but the principle is the same: explicit beats implicit. - Keep judge use rare. Most assertions belong to
exact,contains,regex,schema, orsnapshot. Reach forllm-judgewhen none of those fit and you genuinely care about a fuzzy property the deterministic matchers cannot express. - Set
--max-costlow until you trust the estimate. A misconfigured suite that pings the judge for every assertion can rack up real money. The cap is your seatbelt.
See also
docs/yaml-reference.md#llm-judge-matcher, the matcher reference.docs/research-references.md, the literature grounding (model choice, bias mitigation, calibration).- Previous: URL target staging.
- Next: Catch schema drift.
- Back to the scenario index.