Scenario 7: LLM-judge matcher preview

The llm-judge matcher routes a tool's response through an LLM with a grading rubric and passes when the judge's score meets a threshold. It is the right tool for "the response should be correct" when correctness is too fuzzy for exact, contains, or regex. The cost: every assertion makes an extra LLM call, which costs money and adds latency.

v1.0 ships the YAML surface and a stub runner that returns NotYetAvailable. The full runner is planned for a later release. The schema accepts the shape today so you can author suites ahead of the runner; this scenario shows the intended end-to-end flow and the --max-cost safety cap that will gate the feature when it ships.

The literature grounding (which models to use, how to mitigate position bias, calibration techniques) is in docs/research-references.md.

The YAML

Save this as tests/judge.yml:

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json

servers:
  remote_api:
    url: "https://mcp.example.com/v1"
    auth:
      bearer_token_env: "MCPTEST_API_TOKEN"

tools:
  - name: "summary mentions the service and the release tag"
    server: remote_api
    tool: "summarize_deployment"
    args:
      service: "checkout"
      release: "v2.31.4"
    expect:
      - target: "result.content[0].text"
        matcher:
          llm-judge:
            rubric: |
              The answer must mention the service name "checkout"
              and the release tag "v2.31.4". Score 1.0 if both are
              present and the summary is plausible; 0.5 if only one
              is present; 0.0 if neither is present or the answer
              is off-topic.
            threshold: 0.7
            model: "claude-sonnet-4-5"

  - name: "error explanation is actionable"
    server: remote_api
    tool: "explain_error"
    args:
      code: "ERR_RATE_LIMIT"
    expect:
      - target: "result.content[0].text"
        matcher:
          llm-judge:
            rubric: |
              The answer must explain that the request was rate
              limited and suggest a concrete next step (retry with
              backoff, contact support, upgrade plan). Score 1.0 if
              both are present, 0.5 if only one, 0.0 otherwise.
            threshold: 0.7

Two llm-judge assertions. The first pins a judge model explicitly; the second lets the runner pick the default per the W7 design. The rubric is a free-form string handed to the judge.

The `--max-cost` safety cap

The mcptest eval subcommand (not mcptest run, which skips evals by default) gates the judge feature behind a budget:

# abort the run once accumulated judge cost would exceed $0.50
mcptest eval --max-cost 0.50 tests/judge.yml

--max-cost N is a hard ceiling in USD across every LLM-judge call in the run. The run aborts the moment accumulated cost would exceed N, so a misconfigured suite cannot rack up unbounded spend.

Expected output (after the full runner ships)

mcptest eval tests/judge.yml --max-cost 0.50

  PASS  summary mentions the service and the release tag    (score 0.92, $0.0014, 1.8s)
  PASS  error explanation is actionable                     (score 0.81, $0.0011, 1.5s)

2 passed, 0 failed in 3.3s ($0.0025 total, under $0.50 cap)

On failure:

  FAIL  summary mentions the service and the release tag    (score 0.55, $0.0014, 1.8s)
    llm-judge: score 0.55 < threshold 0.70
    explanation: response mentioned the service "checkout" but not the release tag "v2.31.4"

The reporter shows the score, the cost, and the judge's free-form explanation. The cost line at the end accumulates across the run so you can see the total at a glance.

Today's output (v1.0, runner stub)

Until the full runner lands, the matcher accepts the YAML and the schema validates the shape, but the runner returns NotYetAvailable:

mcptest eval tests/judge.yml

  SKIP  summary mentions the service and the release tag    (llm-judge: NotYetAvailable)
  SKIP  error explanation is actionable                     (llm-judge: NotYetAvailable)

0 passed, 0 failed, 2 skipped in 4ms

SKIP is not a pass; the runner exits non-zero so CI gates do not accidentally promote a skipped judge to a green check. Once the full runner ships, the same YAML runs end-to-end without edits.

Practical advice

Write rubrics like prompts. Concrete scoring criteria (Score 1.0 if both X and Y are present, 0.5 if only one) work better than vague ones (The answer should be good). The judge is just another LLM; it benefits from explicit instructions.
Pin the judge model. A judge model change is a behavior change for your suite. The cache key does not include the judge model, because LLM-judge results are not cached at all (the eligibility engine excludes any test whose matcher set includes llm-judge). Pinning the model is a docs concern, not a cache concern, but the principle is the same: explicit beats implicit.
Keep judge use rare. Most assertions belong to exact, contains, regex, schema, or snapshot. Reach for llm-judge when none of those fit and you genuinely care about a fuzzy property the deterministic matchers cannot express.
Set --max-cost low until you trust the estimate. A misconfigured suite that pings the judge for every assertion can rack up real money. The cap is your seatbelt.