LLM evaluation: judges, juries, and when not to use them

The llm-judge matcher lets you assert qualities of a response that resist deterministic comparison: "the answer is on topic," "no toxic content," "the explanation is correct in spirit." A single judge model is cheap and fast. A jury of judges (multiple models, or multiple runs of the same model with different seeds) is the standard fix for the noise a single judge produces.

This guide covers when to reach for an LLM evaluation, when to stay with a deterministic matcher, and how to spend money on judges without surprise. The literature grounding is in docs/research-references.md. A worked single-judge example ships at examples/llm-judge-basic/; a worked jury example ships at examples/llm-jury-consensus/.

When to use an LLM evaluation

Reach for llm-judge when the correctness criterion is fuzzy and the acceptable answer set is wide. Examples:

"The summary mentions the service name and the release tag." A regex could miss valid paraphrases ("the v1.4.0 release of search-svc").
"The error message is friendly and explains the cause." Friendliness is not a regex.
"The code suggestion compiles and follows our house style." A judge can read the diff and rate it against a rubric.
"The agent's reply does not reveal customer PII." A judge can scan free text for shapes a regex would miss.
"The tool's output is internally consistent across fields." A judge can reason about cross-field constraints a JSON Schema cannot express.

The shared property is that no closed-form expression captures the test. A regex, a JSON Schema, or an exact matcher is wrong on either the permissive or the strict side.

When NOT to use an LLM evaluation

Stay with deterministic matchers for everything below. They are cheaper, faster, reproducible, and never wrong about a binary fact:

Structural shape: use schema. The response must validate against this JSON Schema.
Literal equality: use exact. The value must equal this string or object.
Substring or sub-object containment: use contains.
Pattern matching: use regex.
Snapshot drift: use snapshot.
Latency: use max_duration_ms on the expect block.
Token cost: use max_response_tokens.
Response headers (URL servers): use response_headers.

Deterministic checks are not "lower tier" than judge checks. They are the foundation a judge sits on top of. A test suite that gates a feature release on a single LLM judge call is one rate-limited model away from a red CI for no reason.

A practical rule: every test that uses llm-judge should pair it with at least one deterministic check that the response was even shaped correctly. The deterministic check catches the boring failure modes (empty response, wrong tool, timeout) so the judge only has to read responses that already cleared the structural bar.

Single judge vs jury

A single judge is one model call per assertion. A jury is N model calls per assertion; the jury passes when the fraction of jurors that pass meets a quorum.

Dimension	Single judge	Jury
Cost per assertion	1x	Nx (3 to 7 is typical)
Latency per assertion	1x	1x with parallel calls, Nx with early termination
Variance across runs	Higher	Lower (consensus averages out outlier verdicts)
Calibration against a labeled set	Harder; one model's biases dominate	Easier; compare each juror to the labels and tune its per-juror threshold or drop it
Setup complexity	Pick a model, write a rubric	Pick N models, write a rubric, set a quorum
Resistance to a flaky vendor	None; a 500 fails the test	High; a failed juror is excluded if the survivor pool can still decide

Pick a single judge for nightly or weekly runs where one flaky verdict is acceptable. Pick a jury when:

A failed assertion blocks a release and you cannot afford a coin flip.
A regulated environment demands a documented evaluation strategy.
The criterion is genuinely contested (different reasonable judges disagree) and you want the disagreement surfaced in reporter output.

A useful starting jury is three models from different vendors at the default simple-majority quorum: 0.5. Two vendors degrading at once is rare, and three opinions are enough to break ties without paying for five.

Cost budgeting

A hard dollar ceiling lives on the dedicated mcptest eval command, which exists to run judge and jury evaluations with a budget:

mcptest eval --max-cost 5.00 suite.yml

--max-cost <USD>: hard ceiling in dollars for all judge calls in the evaluation. The runner tracks per-juror cost using each provider's published rates and aborts further calls when the budget is exhausted. Aborted assertions return a budget-exhausted failure with the running total attached. Use this in CI to keep a runaway loop from billing your organization. Accepted forms: --max-cost 5.00, --max-cost $5.00, --max-cost 5. Negative values are rejected at parse time.
--explain: print the per-juror reasoning and cost breakdown so you can see where the budget went.
--no-verdict-cache: skip the LLM-judge verdict cache for this run (see below). The cache, on by default, is the cheapest cost control there is.

mcptest run itself takes --no-verdict-cache but not --max-cost: a plain run gates pass/fail and is not the place for a spend ceiling. When you want to budget judge spend, reach for mcptest eval.

A reasonable budget per CI job:

Smoke job (every PR): mcptest eval --max-cost 1.00. Catches egregious regressions cheaply; the verdict cache absorbs unchanged assertions.
Nightly job: mcptest eval --max-cost 20.00. Runs every judge assertion against the full jury.
Pre-release job: mcptest eval --max-cost 50.00 plus a wider rubric set. Slow, expensive, gates the release.

The reporter prints the cost summary at the end of every run so you can trim or grow the budget per environment.

Prompt engineering tips

The rubric: field is a prompt the judge model sees verbatim. The same prompt-engineering hygiene that applies to a production agent applies here.

Be specific about pass and fail conditions. A rubric that says "answer should be reasonable" is worthless. "The answer must (1) reference the service by name, (2) mention the release tag, and (3) avoid hallucinated version numbers" is testable.
List positive and negative examples in the rubric body. Judges follow examples better than they follow abstractions. Two passing and two failing examples per rubric is usually enough.
Ask for structured output. Tell the judge to return a JSON object with pass: bool, score: float in [0, 1], and reason: string. The runner parses the result strictly; a judge that wanders off the format is counted as a failed juror.
Avoid loaded language. Words like "good," "best," and "obvious" pull every model toward a higher score. Use neutral, descriptive language.
Pin the model in the rubric, not just the config. When the rubric says "you are evaluating answers from a customer-support agent against the criteria below," the judge stays on task more reliably than when it is fed the raw assertion.
Test the rubric before you ship the suite. Run the suite once with five known-good responses and five known-bad responses. If the judge rates them in the wrong order, rewrite the rubric.

A workable rubric template:

You are a strict evaluator. Read the response below and decide whether it
satisfies the criteria.

Criteria:
1. <one-line criterion>
2. <one-line criterion>
3. <one-line criterion>

Examples of passing responses:
- <one-line example>
- <one-line example>

Examples of failing responses:
- <one-line example>
- <one-line example>

Return a JSON object: {"pass": true|false, "score": 0.0-1.0, "reason": "<one sentence>"}.

Choosing a quorum

There are no named consensus methods. The jury passes when the fraction of passing jurors meets quorum, so you pick the fraction that matches how contested the criterion is. The companion guide jury-consensus.md covers this in full with worked examples.

Policy	`quorum`	Use when
Simple majority	`0.5` (default)	The criterion is contested but a majority is enough.
Two of three	`0.67`	One dissenter is fine, two is a warning sign.
Unanimous	`1.0`	Hard safety or PII (personally identifiable information) properties: any reasonable juror calling it bad means bad.
k of n	`k / n`	Any explicit ratio. Four of five is `0.8`.

A pragmatic default: three jurors at the default quorum: 0.5. Raise the quorum toward 1.0 for assertions that touch safety or PII. To lean on a better-calibrated juror, give the weaker jurors a stricter per-juror threshold so they pass less readily.

Putting it together: a balanced test

A test that uses an LLM judge well rests on a foundation of deterministic checks.

tools:
  - name: "summarizer mentions the service and the release tag"
    server: production
    tool: summarize_deploy
    args:
      service: "search-svc"
      tag: "v1.4.0"
    expect:
      assertions:
        # 1. Deterministic: the response is not empty.
        - target: "result.content[0].text"
          matcher:
            schema:
              type: string
              minLength: 50

        # 2. Deterministic: it contains the literal tag.
        - target: "result.content[0].text"
          matcher:
            contains: "v1.4.0"

        # 3. Judge: it stays on topic and is well-written.
        - target: "result.content[0].text"
          matcher:
            llm-judge:
              rubric: |
                The summary must (1) mention the service name "search-svc",
                (2) reference the release tag v1.4.0, and (3) read as a
                deployment note rather than marketing copy. Return JSON
                {"pass": bool, "score": 0..1, "reason": "..."}.
              threshold: 0.7

      max_duration_ms: 5000
      max_response_tokens: 800

The deterministic checks fire first. If the response is too short or omits the tag, the test fails fast and the judge call never happens. That is the cheapest, fastest, most reproducible kind of failure: the one a regex caught.

Cost controls

The Cost budgeting section above introduces mcptest eval --max-cost. This section pins the settings that control judge spend and shows how they compose on a realistic test suite.

`--max-cost <USD>` (on `mcptest eval`)

Hard ceiling on cumulative LLM spend across the whole evaluation. The runner records every juror call against a cost ledger and refuses to dispatch the next call once the cap is reached. Aborted assertions return a budget-exhausted failure with the running total attached; an unset value leaves the budget unbounded.

Accepted forms on the CLI: --max-cost 5.00, --max-cost $5.00, --max-cost 5. Negative values are rejected at parse time. This flag lives on mcptest eval (and on mcptest pipe and mcptest tools call), not on mcptest run.

`--no-verdict-cache`

Both mcptest run and mcptest eval cache LLM-judge verdicts keyed by the prompt, model, and inputs. The cache is on by default, so a rerun that changed nothing re-spends nothing; it is the cheapest cost control there is. Pass --no-verdict-cache to force fresh verdicts (for example when you are calibrating a rubric and want every run to re-judge).

Per-juror `max_tokens`

Every juror entry in the YAML config carries an optional max_tokens: u32 field. The runner threads this into the provider request as the output-token cap. Combined with --max-cost it puts a hard upper bound on a runaway generation: a juror that ignores the rubric and starts streaming filler still cannot overshoot the per-call budget.

Worked example

A weekly pre-release sweep on a 200-test suite with a jury of three GPT-4o-mini judges:

mcptest eval --max-cost 5.00 examples/pre-release/

Each juror call averages about 800 input + 200 output tokens at GPT-4o-mini rates (0.15 / 0.60 per million). One assertion across three jurors costs (800 * 0.15 + 200 * 0.60) / 1_000_000 * 3 ~= $0.00072. At 200 assertions that is roughly $0.144. The --max-cost 5.00 cap leaves a 35x cushion for retries and longer prompts. If the suite ever grows past that, the runner aborts early and the reporter shows the budget exhaustion in the run footer:

Summary: 184 passed, 16 failed, 0 skipped in 91342 ms
Cost: $5.0012 total, $0.00083/call avg, $0.02508/test avg (6024 model calls across 199 tests)

For a cheap daily PR-smoke job, lean on the verdict cache (unchanged assertions re-spend nothing) and a tight ceiling:

mcptest eval --max-cost 0.50 examples/pre-release/