External scorers: grade evals with your own stack

mcptest ships a built-in LLM judge, but many teams already pay for an eval platform (Braintrust, Patronus, LangSmith, Humanloop, promptfoo, ...) with maintained evaluators, datasets, and dashboards. The exec scorer lets you plug that stack in instead of choosing between it and mcptest: you write a thin wrapper script, mcptest hands it each candidate response, and your wrapper returns a verdict. Reporters, JSON output, --max-cost, and verdict caching stay uniform no matter who scored.

This guide covers the contract, the config, the error modes, and a worked example. Reference wrappers ship in examples/scorers/.

The contract

An exec scorer is any executable that:

reads one JSON document from stdin, then sees EOF, and
writes one JSON verdict to stdout and exits 0.

Request (mcptest writes this to stdin)

{
  "input": "the prompt that produced the response",
  "output": "the candidate response under evaluation",
  "criteria": "the grading rubric, free-form text",
  "metadata": { "model": "gpt-4o", "eval": "summary-quality" }
}

metadata is an ordered object of scorer-specific hints and may be empty.

Verdict (your wrapper writes this to stdout)

{
  "verdict": "pass",
  "score": 0.92,
  "reasoning": "the summary names the service and the release tag",
  "cost_usd": 0.0003
}

verdict is "pass" or "fail" and is required.
score is an optional confidence in [0, 1].
reasoning is optional free text.
cost_usd is the dollar cost of the call, or null when your platform cannot report one. A null cost does not count toward --max-cost (mcptest warns about this at startup so the cap is not silently leaky).

Configuration

Scorers live in a top-level scorers: block. (YAML cannot put a scorers: mapping and the evals: sequence under one key, so scorers sit at the top level next to evals:.)

servers:
  main:
    command: ["./my-mcp-server"]

evals:
  - name: summary-quality
    server: main
    prompt: "Summarize the incident report."
    rubric: "Names the service and the release tag, under 50 words."

scorers:
  - name: braintrust-factuality
    type: exec
    command: ["python3", "examples/scorers/braintrust.py"]
    env:
      BRAINTRUST_API_KEY: "${BRAINTRUST_API_KEY}"
    cwd: "."
    timeout_ms: 30000

Field	Required	Default	Notes
`name`	yes	-	Folded into the verdict cache key, so renaming invalidates cached verdicts.
`type`	yes	-	`exec` for this guide. Open-vocabulary, but only `exec` resolves in this build.
`command`	yes (exec)	-	Argv; first element is the executable.
`env`	no	none	Overlay on the process env. `${VAR}` references resolve from mcptest's environment; an unset reference is a load error.
`cwd`	no	mcptest cwd	Working directory for the wrapper.
`timeout_ms`	no	`30000`	Per-call wall-clock timeout.

The type is open-vocabulary on purpose. The same config shape carries a future http scorer and the named vendor adapters (type: braintrust, type: patronus, ...) that ship in the commercial build. In the OSS build, an unknown type fails with a clear diagnostic rather than at parse time:

scorer type 'braintrust' not available in this build

Error modes

Every failure surfaces in every reporter (text, JSON, JUnit, GitHub Actions) as a structured scorer error, never a panic:

Non-zero exit: the wrapper exited with a failure code. mcptest includes the captured stderr in the diagnostic.
Malformed stdout: blank output, or output that is not a valid {verdict, ...} envelope, fails the test with a parse diagnostic.
Timeout: the wrapper exceeded timeout_ms.
Spawn failure: the executable was missing or not runnable (transport error).

A worked example

Run the bundled promptfoo-style wrapper, which scores offline so you can try it without an API key:

scorers:
  - name: promptfoo-keyword
    type: exec
    command: ["python3", "examples/scorers/promptfoo.py"]

mcptest eval --config mcptest.yaml

mcptest prints one line per (eval, scorer) outcome:

PASS [exec] summary-quality: cost=$0.0000 matched criteria keyword
Scorer cost: $0.0000

Swap the wrapper body for a real call to your platform's SDK. The reference wrappers in examples/scorers/ are 30-50 lines each and show exactly where the SDK call goes.

Verdict caching

When verdict caching is enabled, an exec scorer's output is cached on (framework version, scorer name, scorer version, request). A change to your wrapper that you signal by bumping the scorer name (for example judge-v2) invalidates only that scorer's cached verdicts. An mcptest framework upgrade that changes the scoring contract invalidates every cached verdict automatically, because the framework version is folded into the key.

What lives in the commercial build

The OSS exec primitive already covers every vendor through a wrapper script. The commercial build adds named, maintained adapters (type: braintrust, etc.) with managed UX, bidirectional dataset and dashboard sync, centralized eval history with PR diffing, and BYO-KMS for third-party scorer secrets. The config shape and the verdict envelope are identical, so moving from a wrapper to a managed adapter is a one-line type: swap.