mcptest docs GitHub

LLM evaluation: judges, juries, and when not to use them

The llm-judge matcher lets you assert qualities of a response that resist deterministic comparison: "the answer is on topic," "no toxic content," "the explanation is correct in spirit." A single judge model is cheap and fast. A jury of judges (multiple models, or multiple runs of the same model with different seeds) is the standard fix for the noise a single judge produces.

This guide covers when to reach for an LLM evaluation, when to stay with a deterministic matcher, and how to spend money on judges without surprise. The literature grounding is in docs/research-references.md. A worked single-judge example ships at examples/llm-judge-basic/; a worked jury example ships at examples/llm-jury-consensus/.

When to use an LLM evaluation

Reach for llm-judge when the correctness criterion is fuzzy and the acceptable answer set is wide. Examples:

The shared property is that no closed-form expression captures the test. A regex, a JSON Schema, or an exact matcher is wrong on either the permissive or the strict side.

When NOT to use an LLM evaluation

Stay with deterministic matchers for everything below. They are cheaper, faster, reproducible, and never wrong about a binary fact:

Deterministic checks are not "lower tier" than judge checks. They are the foundation a judge sits on top of. A test suite that gates a feature release on a single LLM judge call is one rate-limited model away from a red CI for no reason.

A practical rule: every test that uses llm-judge should pair it with at least one deterministic check that the response was even shaped correctly. The deterministic check catches the boring failure modes (empty response, wrong tool, timeout) so the judge only has to read responses that already cleared the structural bar.

Single judge vs jury

A single judge is one model call per assertion. A jury is N model calls per assertion; the jury passes when the fraction of jurors that pass meets a quorum.

DimensionSingle judgeJury
Cost per assertion1xNx (3 to 7 is typical)
Latency per assertion1x1x with parallel calls, Nx with early termination
Variance across runsHigherLower (consensus averages out outlier verdicts)
Calibration against a labeled setHarder; one model's biases dominateEasier; compare each juror to the labels and tune its per-juror threshold or drop it
Setup complexityPick a model, write a rubricPick N models, write a rubric, set a quorum
Resistance to a flaky vendorNone; a 500 fails the testHigh; a failed juror is excluded if the survivor pool can still decide

Pick a single judge for nightly or weekly runs where one flaky verdict is acceptable. Pick a jury when:

A useful starting jury is three models from different vendors at the default simple-majority quorum: 0.5. Two vendors degrading at once is rare, and three opinions are enough to break ties without paying for five.

Cost budgeting

A hard dollar ceiling lives on the dedicated mcptest eval command, which exists to run judge and jury evaluations with a budget:

mcptest eval --max-cost 5.00 suite.yml

mcptest run itself takes --no-verdict-cache but not --max-cost: a plain run gates pass/fail and is not the place for a spend ceiling. When you want to budget judge spend, reach for mcptest eval.

A reasonable budget per CI job:

The reporter prints the cost summary at the end of every run so you can trim or grow the budget per environment.

Prompt engineering tips

The rubric: field is a prompt the judge model sees verbatim. The same prompt-engineering hygiene that applies to a production agent applies here.

A workable rubric template:

You are a strict evaluator. Read the response below and decide whether it
satisfies the criteria.

Criteria:
1. <one-line criterion>
2. <one-line criterion>
3. <one-line criterion>

Examples of passing responses:
- <one-line example>
- <one-line example>

Examples of failing responses:
- <one-line example>
- <one-line example>

Return a JSON object: {"pass": true|false, "score": 0.0-1.0, "reason": "<one sentence>"}.

Choosing a quorum

There are no named consensus methods. The jury passes when the fraction of passing jurors meets quorum, so you pick the fraction that matches how contested the criterion is. The companion guide jury-consensus.md covers this in full with worked examples.

PolicyquorumUse when
Simple majority0.5 (default)The criterion is contested but a majority is enough.
Two of three0.67One dissenter is fine, two is a warning sign.
Unanimous1.0Hard safety or PII (personally identifiable information) properties: any reasonable juror calling it bad means bad.
k of nk / nAny explicit ratio. Four of five is 0.8.

A pragmatic default: three jurors at the default quorum: 0.5. Raise the quorum toward 1.0 for assertions that touch safety or PII. To lean on a better-calibrated juror, give the weaker jurors a stricter per-juror threshold so they pass less readily.

Putting it together: a balanced test

A test that uses an LLM judge well rests on a foundation of deterministic checks.

tools:
  - name: "summarizer mentions the service and the release tag"
    server: production
    tool: summarize_deploy
    args:
      service: "search-svc"
      tag: "v1.4.0"
    expect:
      assertions:
        # 1. Deterministic: the response is not empty.
        - target: "result.content[0].text"
          matcher:
            schema:
              type: string
              minLength: 50

        # 2. Deterministic: it contains the literal tag.
        - target: "result.content[0].text"
          matcher:
            contains: "v1.4.0"

        # 3. Judge: it stays on topic and is well-written.
        - target: "result.content[0].text"
          matcher:
            llm-judge:
              rubric: |
                The summary must (1) mention the service name "search-svc",
                (2) reference the release tag v1.4.0, and (3) read as a
                deployment note rather than marketing copy. Return JSON
                {"pass": bool, "score": 0..1, "reason": "..."}.
              threshold: 0.7

      max_duration_ms: 5000
      max_response_tokens: 800

The deterministic checks fire first. If the response is too short or omits the tag, the test fails fast and the judge call never happens. That is the cheapest, fastest, most reproducible kind of failure: the one a regex caught.

Cost controls

The Cost budgeting section above introduces mcptest eval --max-cost. This section pins the settings that control judge spend and shows how they compose on a realistic test suite.

--max-cost <USD> (on mcptest eval)

Hard ceiling on cumulative LLM spend across the whole evaluation. The runner records every juror call against a cost ledger and refuses to dispatch the next call once the cap is reached. Aborted assertions return a budget-exhausted failure with the running total attached; an unset value leaves the budget unbounded.

Accepted forms on the CLI: --max-cost 5.00, --max-cost $5.00, --max-cost 5. Negative values are rejected at parse time. This flag lives on mcptest eval (and on mcptest pipe and mcptest tools call), not on mcptest run.

--no-verdict-cache

Both mcptest run and mcptest eval cache LLM-judge verdicts keyed by the prompt, model, and inputs. The cache is on by default, so a rerun that changed nothing re-spends nothing; it is the cheapest cost control there is. Pass --no-verdict-cache to force fresh verdicts (for example when you are calibrating a rubric and want every run to re-judge).

Per-juror max_tokens

Every juror entry in the YAML config carries an optional max_tokens: u32 field. The runner threads this into the provider request as the output-token cap. Combined with --max-cost it puts a hard upper bound on a runaway generation: a juror that ignores the rubric and starts streaming filler still cannot overshoot the per-call budget.

Worked example

A weekly pre-release sweep on a 200-test suite with a jury of three GPT-4o-mini judges:

mcptest eval --max-cost 5.00 examples/pre-release/

Each juror call averages about 800 input + 200 output tokens at GPT-4o-mini rates (0.15 / 0.60 per million). One assertion across three jurors costs (800 * 0.15 + 200 * 0.60) / 1_000_000 * 3 ~= $0.00072. At 200 assertions that is roughly $0.144. The --max-cost 5.00 cap leaves a 35x cushion for retries and longer prompts. If the suite ever grows past that, the runner aborts early and the reporter shows the budget exhaustion in the run footer:

Summary: 184 passed, 16 failed, 0 skipped in 91342 ms
Cost: $5.0012 total, $0.00083/call avg, $0.02508/test avg (6024 model calls across 199 tests)

For a cheap daily PR-smoke job, lean on the verdict cache (unchanged assertions re-spend nothing) and a tight ceiling:

mcptest eval --max-cost 0.50 examples/pre-release/

See also