Judge calibration

A jury reports a verdict and a confidence band: high, medium, or low. The band tells you how much the jurors agreed with each other. It does not tell you two things you often need before you let a judge gate a pull request:

Does the judge's stated confidence actually track how often it is right? A judge that says "95% sure" and is correct half the time is overconfident, and the band alone will not catch that.
If the judge is imperfect, can you still trust the rate it reports? A handful of human-labeled examples can turn a noisy judge into a statistically valid gate.

The calibration module in mcptest-core answers both with pure, deterministic statistics. No model calls, no network. You hand it labeled verdicts (judge confidence plus the trusted ground-truth outcome) and it hands back numbers you can act on. The walkthrough below shows the whole loop end to end; the formulas and the routing rule come after it. The grounding papers are in research-references.md.

How it works, end to end

Calibration does not run your judge. It scores a file of past verdicts that you labeled by hand, and fails the suite when the judge's stated confidence does not line up with how often it was actually right. Four steps:

You already run a judge. Somewhere in your suite an llm-judge or llm-jury matcher scores responses, and each verdict carries a confidence in [0, 1].
Label a sample once, by hand. Take 20 to 50 cases whose correct answer you already know. Run your judge over them and, for each, record two things: the confidence the judge reported, and whether its verdict was actually right. Write one row per case into a labels file:
```
{"confidence": 0.95, "correct": true}
{"confidence": 0.55, "correct": true}
{"confidence": 0.52, "correct": false}
{"confidence": 0.10, "correct": false}
```
Nothing generates this for you. It is your ground truth, so you write it.
Add a calibration: block that points at the file:
```
calibration:
  - name: "judge stays calibrated"
    labels: ./calibration-labels.jsonl
```
With no expect:, the default gate is ece <= 0.10 and brier <= 0.25 (both explained below; lower is better, and 0 is perfect).
Run the suite. The engine reads the file, computes how far the judge's confidence drifts from its real accuracy, and passes or fails the entry like any other test:
```
$ mcptest run --config calibration.yml --reporter minimal
ran 2 tests: 2 passed, 0 failed
```

A worked example. The shipped examples/calibration-labels.jsonl has 12 rows. Across them the judge claims an average confidence of 0.57 and is right on 7 of the 12 (0.58), so what it says lines up with how it does: ECE works out to 0.094 and Brier to 0.050, both under the default gates, so the check passes. Flip that, a judge that says 0.95 on cases it gets right only half the time, and ECE climbs past 0.10 and the check fails. That gap, between how sure the judge sounds and how often it is right, is the regression this catches before you let the judge gate a pull request.

The judge or jury that produces the verdicts

Step 1 above assumes you already have a judge or jury in your suite. If you do not yet, here are complete suites to copy from. The judge runs only after the deterministic checks pass, so a structurally broken response never spends a model call (examples/llm-judge-basic/):

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json
servers:
  production:
    url: "https://mcp.example.com/v1"
    auth:
      bearer_token_env: "MCPTEST_PRODUCTION_TOKEN"

tools:
  - name: "summary reads as a deployment note"
    server: production
    tool: "summarize_deploy"
    args:
      service: "search-svc"
      tag: "v1.4.0"
    timeout_ms: 8000
    expect:
      assertions:
        - target: "result.content[0].text"
          matcher:
            llm-judge:
              rubric: |
                Pass if the summary names the service and the release tag and
                reads as an internal deployment note, not marketing copy.
                Return {"pass": bool, "score": 0.0-1.0, "reason": "<one sentence>"}.
              threshold: 0.7
              model: "claude-opus-4-7"
          message: "summary should read as a deployment note"

A jury is the same suite with the matcher swapped: several jurors grade independently, and the jury reports a verdict plus a confidence band. That band is the input to the routing rule below (examples/llm-jury-consensus/):

        - target: "result.content[0].text"
          matcher:
            llm-jury:
              rubric: |
                Pass if the summary names the service and the release tag and
                reads as an internal deployment note, not marketing copy.
                Return {"pass": bool, "score": 0.0-1.0, "reason": "<one sentence>"}.
              threshold: 0.7
              quorum: 0.67
              jurors:
                - model: "claude-opus-4-7"
                - model: "gpt-4o"
                - model: "gemini-2.5-pro"
          message: "summary should read as a deployment note across the jury"

Run this example. Both files validate and run as-is:

mcptest run --config examples/llm-judge-basic/tests/judge-basic.yml
mcptest run --config examples/llm-jury-consensus/tests/jury-consensus.yml

Each jury assertion reports a confidence band (high, medium, or low) and an escalate_for_review flag next to the pass or fail. See jury consensus for that output shape. The rest of this page is what you do with those verdicts: route the low-confidence ones, and (with a labeled set) check whether the stated confidence can be trusted at all.

You reach for this when you have labeled verdicts to score: each one is the judge's stated confidence in [0, 1] paired with correct: true|false, whether that verdict matched a trusted label. You produce the labels file by running your judge against a hand-labeled set and recording, per item, its stated confidence and whether it matched.

The `calibration:` block

A top-level calibration: list gates a judge on its calibration. Each entry requires a name and a labels file; reliability, observed_positive_rate, and expect are optional. It exposes the computed metrics as assertion targets you gate with the standard matchers:

calibration:
  - name: "judge stays calibrated"
    labels: ./calibration-labels.jsonl   # one {confidence, correct} per line
    # optional bias-correction inputs; enable the corrected_rate target:
    reliability: { tp: 90, fn: 10, tn: 80, fp: 20 }
    observed_positive_rate: 0.50
    expect:                          # optional; defaults apply if omitted
      - target: ece
        matcher: { schema: { maximum: 0.10 } }
      - target: brier
        matcher: { schema: { maximum: 0.25 } }
      - target: corrected_rate
        matcher: { schema: { maximum: 0.50 } }

The targets:

ece: Expected Calibration Error, a number in [0, 1], lower is better.
brier: the Brier score, a number in [0, 1], lower is better.
corrected_rate: the Rogan-Gladen bias-corrected positive rate. Present only when both reliability and observed_positive_rate are given.
corrected_rate_low / corrected_rate_high: the Wald 95% confidence interval around corrected_rate. Present under the same gate as corrected_rate (both reliability and observed_positive_rate given).

The labels file

Each row is one item you judged: confidence is the score the judge reported for its own verdict, and correct is true when that verdict matched your trusted label and false when it did not. You build the file once, by hand: take a set of cases whose right answer you already know, run your judge over them, and write one row per case. A real calibration-labels.jsonl:

{"confidence": 0.95, "correct": true}
{"confidence": 0.90, "correct": true}
{"confidence": 0.82, "correct": true}
{"confidence": 0.55, "correct": true}
{"confidence": 0.52, "correct": false}
{"confidence": 0.15, "correct": false}
{"confidence": 0.10, "correct": false}
{"confidence": 0.05, "correct": false}

Read that top to bottom: this judge is mostly well sorted (confident when it was right, unsure when it was wrong), except the 0.52 -> false row, where it was more than half sure and still wrong. ECE and Brier fold that whole column into one number you gate on, so you catch a judge whose stated confidence has drifted away from how often it is actually right.

The file is JSONL (one object per line, as above) or a YAML array of the same objects. A missing or unparseable file is a load error. An empty labels set is not an error (ECE is 0 by definition) but the check emits a warning so it does not read as a silent pass. The worked file ships at examples/calibration-labels.jsonl.

Omit expect: and the engine applies the sane defaults: ece <= 0.10, brier <= 0.25, and, when the corrected-rate inputs are present, corrected_rate <= observed_positive_rate (no worse than the raw rate). Each entry emits one PASS/FAIL row on the run report, and a failing entry exits non-zero like any test.

The worked suite is examples/calibration.yml.

Per-verdict calibration

Two metrics measure how well stated confidence matches observed correctness. For both, lower is better.

Expected Calibration Error (ECE)

The ece target is the headline metric. The engine sorts the judge's stated confidences into ten equal-width bins over [0, 1]. For each populated bin it compares the mean confidence in that bin to the actual accuracy in that bin, then averages those gaps weighted by how many samples fell in each bin:

ECE = sum_b (n_b / N) * | mean_confidence_b - accuracy_b |

A perfectly calibrated judge scores 0.0: in every bin, the confidence it claimed equals the rate at which it was right. An overconfident judge that says 1.0 on a set it gets right only half the time scores 0.5 on the populated bin. An empty set scores 0.0 because there is no claim to disagree with.

This is the standard Expected Calibration Error, motivated by Overconfidence in LLM-as-a-Judge (arXiv:2508.06225), which documents that judge confidence routinely overstates accuracy, so confidence has to be validated against labels rather than trusted directly.

Brier score

The brier target is the mean squared error between stated confidence and the 0/1 outcome:

Brier = (1/N) * sum ( confidence - outcome )^2

It ranges [0.0, 1.0], lower is better. Unlike ECE it rewards sharpness as well as calibration: a judge that always says 0.5 scores 0.25 no matter the outcome, while a confident and correct judge scores near 0.0. Report it alongside ECE when you want a single number that punishes hedging.

Small-trusted-set validation

Hand-label a small trusted set, run the judge against it, and you can measure how the judge errs, then correct the rate it reports on the much larger unlabeled set. This is the corrected_rate target: supply a reliability: confusion matrix and an observed_positive_rate: on the calibration: entry and the engine exposes the bias-corrected rate to assert on.

Measuring reliability

The reliability: block is a confusion matrix: four counts from a trusted set where you know the right answer, comparing what the judge said to the truth.

tp (true positives): the judge passed it and it should pass.
fn (false negatives): the judge failed it but it should have passed.
tn (true negatives): the judge failed it and it should fail.
fp (false positives): the judge passed it but it should have failed.

So reliability: { tp: 90, fn: 10, tn: 80, fp: 20 } means: of 100 items that should pass, the judge caught 90 and missed 10; of 100 that should fail, it correctly failed 80 and wrongly passed 20. The engine turns those four counts into the judge's sensitivity and specificity:

sensitivity = tp / (tp + fn), the true-positive rate (how often the judge passes things that should pass).
specificity = tn / (tn + fp), the true-negative rate (how often the judge fails things that should fail).

Each rate guards its own divide-by-zero. A trusted set with no positive-truth items yields sensitivity 0.0 rather than a panic, and likewise for specificity.

Rogan-Gladen bias correction

The corrected_rate target recovers the true positive rate from the rate the imperfect judge actually reported (observed_positive_rate):

true = (observed + specificity - 1) / (sensitivity + specificity - 1)

The result is clamped to [0, 1]. Worked example: a judge with sensitivity 0.9 and specificity 0.8 that passes 50% of items gives

(0.5 + 0.8 - 1) / (0.9 + 0.8 - 1) = 0.3 / 0.7 = 0.4286

so the corrected true positive rate is about 0.43, below the 0.50 the judge reported.

The denominator is Youden's J (sensitivity + specificity - 1). It is the amount of real signal the judge carries. When J is non-positive the correction is skipped and the observed rate is returned unchanged. An uninformative coin-flip judge (sensitivity 0.5, specificity 0.5) has J = 0, so corrected_rate equals the observed rate: with no signal to invert, correcting would only amplify noise (or divide by zero). The same holds for an anti-correlated judge where J would go negative.

This estimator and its uncertainty are the subject of Bias + Uncertainty (arXiv:2605.06939) and Noisy but Valid (arXiv:2601.20913), which is the case that a small trusted set makes an imperfect judge into a statistically valid gate.

The confidence interval

A point estimate alone hides how much you should trust it. A corrected rate of 0.43 computed from twenty labels is a much shakier number than the same 0.43 from a thousand. The corrected_rate_low and corrected_rate_high targets are the Wald 95% confidence interval around corrected_rate, so you can gate on the band rather than the point:

expect:
  # The whole plausible range should sit under the bar, not just the
  # point estimate.
  - target: corrected_rate_high
    matcher: { schema: { maximum: 0.60 } }

The interval treats the observed positive rate as a binomial proportion over the trusted set, builds the Wald band on it, then maps both endpoints through the same Rogan-Gladen correction. The trusted-set size is the total of the reliability: counts (tp + fn + tn + fp), so a larger hand-labeled set yields a tighter band: an empty set gives a zero-width band (no samples, no uncertainty estimate), and the band narrows as you label more. Both endpoints are clamped to [0, 1], and the engine orders them so corrected_rate_low is always the smaller even for an anti-correlated judge where the correction inverts.

The Wald form is the simple, documented choice; it is known to be loose near 0 and 1, which is acceptable for a small-trusted-set sanity band. The grounding is Noisy but Valid (arXiv:2601.20913).

Routing low-confidence verdicts

The accept-or-escalate decision is a band check: escalate a low confidence band, accept high or medium. You write that rule directly as an assertion on the jury that runs in the test, with no extra YAML:

- target: "result.content[0].text"
  matcher:
    llm-jury:
      rubric: "Pass when the explanation is correct and clear."
      jurors:
        - { model: claude-sonnet-4-5 }
        - { model: gpt-5 }
        - { model: gemini-2.5-pro }
      quorum: 0.66
# Accept only a non-low band; a low band is the escalate case.
- target: jury.confidence
  matcher: { not: { exact: low } }
# Or read the escalation flag straight off the jury:
- target: jury.escalate
  matcher: { exact: false }

The jury.confidence target is the same confidence band the jury reports, and jury.escalate is the same escalate_for_review flag, so the accept-or-escalate decision can never disagree with the band the reporter already shows. A low band is exactly the case the jury sets escalate_for_review: true for. See jury consensus for the full list of jury.* targets.

The escalation target (a larger juror, or a human review queue) is the caller's concern. The human review queue itself is out of scope here; the OSS engine produces only the accept-or-escalate decision you gate on here.