Offline trace validation

Once you have recorded an agent run into a cassette, you usually do not want to call the model again on every CI run. The model is slow, costs money, and gives a slightly different answer each time. What you actually want to check is cheaper and more stable: did the agent call the right functions, with the right argument shapes, in the right order? That check runs entirely against the recorded trace, with no model in the loop.

Run this example. examples/offline-trace-validation.yml pins an agent's tool-call plan (name, argument shape, count). Record it once, then re-check it offline against the cassette with no model in the loop.

ANTHROPIC_API_KEY=... mcptest run --record --config examples/offline-trace-validation.yml
mcptest run --config examples/offline-trace-validation.yml

A fuller trajectory match vocabulary and golden-path efficiency scoring extend the original validator. The validator is the pure function mcptest_core::eval::trace_validation::validate_trace. It takes the tool_calls array out of a recorded trace (or cassette) and an expected call structure, and returns a pass/fail result with per-position diffs. Alongside it, score_golden_path rates the same trajectory for efficiency.

Where the idea comes from

The pattern is borrowed from the Berkeley Function-Calling Leaderboard (BFCL, ICML 2025). BFCL evaluates a model's function calling in two ways. One way executes the calls and checks the results. The other, which BFCL calls AST (abstract syntax tree) checking, never executes anything: it parses the model's emitted call into a structure (function name plus arguments) and matches that structure against an expected call. AST checking is deterministic, free, and does not need a live backend, which makes it the right fit for a CI gate over a recorded cassette.

mcptest applies the same idea to a recorded MCP agent trace. The trace already stores each call as a structured object (name, server, args), so there is nothing to parse. The validator matches that recorded structure against an expected structure you write by hand.

What you assert

An expected trace is an ordered list of expected calls plus a match mode. Each expected call pins a function name and an argument shape:

name: must be exactly equal to the recorded tool_calls[i].name (the tool name without the <server>__ wire prefix).
args: one of four shapes, each backed by an existing matcher so the semantics match what you already know from the YAML matchers.

Args shape	Meaning	Backed by
`any`	Arguments may be anything, including absent. Pins the name only.	nothing
`ignore`	Skip argument checking for this call. Same matching effect as `any`, named to read as a deliberate "do not look at the args here" for noisy or nondeterministic arguments.	nothing
`exact`	Recorded args must deep-equal this value.	deep equality
`subset`	Recorded args must be a superset (object-subset, multiset arrays). Extra recorded keys are fine.	the `contains` matcher
`schema`	Recorded args must validate against this JSON Schema document.	the `schema` matcher

Reusing contains and schema means a malformed JSON Schema surfaces as an error (not a silent pass), and the per-call diffs are the same AssertionDiff shape every other matcher and reporter already speaks.

The five match modes

The mode decides how the expected (reference) calls line up against the recorded calls. The first two shipped originally; the rest were added to match the cross-tool trajectory vocabulary (agentevals, lastmile, DeepEval, Ragas).

strict (alias exact-sequence): the recorded calls must match the reference calls one-for-one. Lengths must be equal, every position must match, order matters, and a trace that made extra trailing calls fails. The tightest mode; use it when the exact call plan is the contract. strict and exact_sequence are the same mode under two names; the newer strict reads better and the original exact_sequence stays for callers written against the first release.
subsequence: every reference call must appear, in the reference order, somewhere among the recorded calls, but the model may interleave other calls between them. Order-preserving, tolerant of interleaved extras (a login here, a log_event there).
unordered: every reference call must appear, matched one-to-one against distinct recorded calls, in any order; extra recorded calls are allowed. Passes regardless of the order the calls came in.
superset: at least the reference calls must be present, in any order; extras allowed. Same pass/fail outcome as unordered; the name signals "the reference is a lower bound on what must happen."
subset: the recorded calls must be a subset of the reference calls, i.e. the trace made no call the reference did not allow. This is the over-calling / wasted-call detector: it fails the moment the trace makes a call beyond the reference set, and it tolerates the trace calling fewer.

How the five relate: strict is the intersection of order and count constraints. Drop "no extras" but keep order and you get subsequence. Drop order entirely but still require every reference call and you get superset (and unordered, which has the same outcome). subset inverts the containment question: where superset asks "are all reference calls present?", subset asks "are all recorded calls allowed?".

An empty reference list is trivially satisfied by any trace under every mode except subset, where "no reference calls" means "no calls allowed," so a non-empty trace fails and an empty trace passes. That keeps "I do not care about the call plan here" expressible without a special case while letting subset express "this step should make no tool calls at all."

What the result carries

validate_trace returns a TraceValidation { passed, mismatches }. On a pass, mismatches is empty. On a failure, each TraceMismatch names the expected-call index, the recorded-call index it was looking at (or None when the trace ran out of calls), the per-call AssertionDiff list, and a one-line human-readable reason. A wrong name at a position reports a diff on the /name pointer; a wrong argument shape reports the diffs the underlying contains or schema matcher produced.

The function returns Err only on a structural problem with an expectation (a malformed JSON Schema). A normal shape mismatch is a successful validation that returns passed = false, the same contract the matchers follow: a failed assertion is not an error.

Pulling the calls out of an envelope

A trace envelope (from ConversationTrace::to_envelope) has tool_calls at the root; a cassette envelope nests it under trace.tool_calls. The helper tool_calls_from_envelope checks the cassette nesting first, then the root, and returns an empty list when neither is present, so you can hand it either shape:

use mcptest_core::eval::{
    tool_calls_from_envelope, validate_trace, ExpectedArgs, ExpectedCall, ExpectedTrace,
};

let calls = tool_calls_from_envelope(&cassette_value);
let expected = ExpectedTrace::subsequence(vec![
    ExpectedCall {
        name: "search".into(),
        args: ExpectedArgs::Subset(serde_json::json!({"q": "rust"})),
    },
    ExpectedCall::by_name("open"),
]);
let result = validate_trace(&calls, &expected)?;
assert!(result.passed);

Golden-path efficiency scoring

Matching answers "did the right calls happen?" The golden path answers a different question: "did they happen efficiently?" It compares the recorded tool sequence against an ideal ("golden") sequence and counts wasted work, then folds the counts into one penalty multiplier. This is the path-efficiency half of the trajectory vocabulary.

A GoldenPath carries the ideal tool names in order plus three policy flags:

use mcptest_core::eval::{score_golden_path, GoldenPath};

let golden = GoldenPath {
    calls: vec!["search".into(), "open".into(), "summarize".into()],
    allow_extra_steps: false,    // penalize calls beyond the path length
    penalize_backtracking: true, // penalize returning to an earlier tool
    penalize_repeated_tools: true, // penalize consecutive duplicate calls
};
let score = score_golden_path(&calls, &golden);
assert!(score.passed); // true only when penalty == 1.0

score_golden_path reads only each recorded call's name and returns a PathScore { passed, extra_steps, backtracks, repeated_tools, penalty }:

extra_steps: recorded calls beyond the golden path length, actual.len().saturating_sub(golden.calls.len()).
backtracks: a call to a tool used earlier that is not the immediately preceding tool. The trajectory moved on to a different tool and then came back. Each such return counts once.
repeated_tools: a call whose tool name equals the immediately preceding call's name (a consecutive duplicate). Consecutive repeats are counted here, not as backtracks.

All three counts are always reported, even for a dimension whose penalty is switched off, so a reporter can show the full picture. Only the enabled dimensions move the penalty.

The penalty formula

Let w be the sum of the enabled waste counts (extra_steps only when allow_extra_steps is false, backtracks only when penalize_backtracking, repeated_tools only when penalize_repeated_tools). Then

penalty = 1.0 / (1.0 + 0.5 * w)

The result is 1.0 exactly when w == 0 and decreases monotonically toward 0.0 as waste grows, so it always lands in (0.0, 1.0] with 1.0 meaning no penalty. passed is w == 0. Because the penalty is monotone in w, a backtracking trace always scores below an otherwise-identical clean trace, and turning a penalty flag off both removes that dimension from w and keeps its raw count visible for reporting.

GoldenPath::new(names) is the strictest policy: every penalty enabled and no tolerance for extra steps.

Current limitation: library only

This release ships the validator as a pure mcptest-core function and its types. There is no YAML matcher surface yet: you cannot write an expected trace in a .yml suite and have mcptest run enforce it. Wiring an expected_trace: block through the suite parser and the executor is a follow-up ticket. For now, the validator is reachable from Rust callers and the SDK hosts that embed mcptest-core.