mcptest docs GitHub

External scorers: grade evals with your own stack

mcptest ships a built-in LLM judge, but many teams already pay for an eval platform (Braintrust, Patronus, LangSmith, Humanloop, promptfoo, ...) with maintained evaluators, datasets, and dashboards. The exec scorer lets you plug that stack in instead of choosing between it and mcptest: you write a thin wrapper script, mcptest hands it each candidate response, and your wrapper returns a verdict. Reporters, JSON output, --max-cost, and verdict caching stay uniform no matter who scored.

This guide covers the contract, the config, the error modes, and a worked example. Reference wrappers ship in examples/scorers/.

The contract

An exec scorer is any executable that:

  1. reads one JSON document from stdin, then sees EOF, and
  2. writes one JSON verdict to stdout and exits 0.

Request (mcptest writes this to stdin)

{
  "input": "the prompt that produced the response",
  "output": "the candidate response under evaluation",
  "criteria": "the grading rubric, free-form text",
  "metadata": { "model": "gpt-4o", "eval": "summary-quality" }
}

metadata is an ordered object of scorer-specific hints and may be empty.

Verdict (your wrapper writes this to stdout)

{
  "verdict": "pass",
  "score": 0.92,
  "reasoning": "the summary names the service and the release tag",
  "cost_usd": 0.0003
}

Configuration

Scorers live in a top-level scorers: block. (YAML cannot put a scorers: mapping and the evals: sequence under one key, so scorers sit at the top level next to evals:.)

servers:
  main:
    command: ["./my-mcp-server"]

evals:
  - name: summary-quality
    server: main
    prompt: "Summarize the incident report."
    rubric: "Names the service and the release tag, under 50 words."

scorers:
  - name: braintrust-factuality
    type: exec
    command: ["python3", "examples/scorers/braintrust.py"]
    env:
      BRAINTRUST_API_KEY: "${BRAINTRUST_API_KEY}"
    cwd: "."
    timeout_ms: 30000
FieldRequiredDefaultNotes
nameyes-Folded into the verdict cache key, so renaming invalidates cached verdicts.
typeyes-exec for this guide. Open-vocabulary, but only exec resolves in this build.
commandyes (exec)-Argv; first element is the executable.
envnononeOverlay on the process env. ${VAR} references resolve from mcptest's environment; an unset reference is a load error.
cwdnomcptest cwdWorking directory for the wrapper.
timeout_msno30000Per-call wall-clock timeout.

The type is open-vocabulary on purpose. The same config shape carries a future http scorer and the named vendor adapters (type: braintrust, type: patronus, ...) that ship in the commercial build. In the OSS build, an unknown type fails with a clear diagnostic rather than at parse time:

scorer type 'braintrust' not available in this build

Error modes

Every failure surfaces in every reporter (text, JSON, JUnit, GitHub Actions) as a structured scorer error, never a panic:

A worked example

Run the bundled promptfoo-style wrapper, which scores offline so you can try it without an API key:

scorers:
  - name: promptfoo-keyword
    type: exec
    command: ["python3", "examples/scorers/promptfoo.py"]
mcptest eval --config mcptest.yaml

mcptest prints one line per (eval, scorer) outcome:

PASS [exec] summary-quality: cost=$0.0000 matched criteria keyword
Scorer cost: $0.0000

Swap the wrapper body for a real call to your platform's SDK. The reference wrappers in examples/scorers/ are 30-50 lines each and show exactly where the SDK call goes.

Verdict caching

When verdict caching is enabled, an exec scorer's output is cached on (framework version, scorer name, scorer version, request). A change to your wrapper that you signal by bumping the scorer name (for example judge-v2) invalidates only that scorer's cached verdicts. An mcptest framework upgrade that changes the scoring contract invalidates every cached verdict automatically, because the framework version is folded into the key.

What lives in the commercial build

The OSS exec primitive already covers every vendor through a wrapper script. The commercial build adds named, maintained adapters (type: braintrust, etc.) with managed UX, bidirectional dataset and dashboard sync, centralized eval history with PR diffing, and BYO-KMS for third-party scorer secrets. The config shape and the verdict envelope are identical, so moving from a wrapper to a managed adapter is a one-line type: swap.