Session ledger

The session ledger is a structured, append-only record of the MCP tool calls a run made, keyed to a session. It exists so you can answer the question behavioral eval actually asks, "did the agent call the right tools, in the right order, with the right parameters, given the right inputs," by querying structure rather than parsing a transcript.

Two failure modes the ledger avoids:

Logging the transcript instead of the structure. A transcript has the tool calls in it, but extracting them is regex over model output that breaks silently the moment the model formats its thoughts differently. The ledger records each call as first-class fields.
Grading prose instead of behavior. "The agent wrote a correct email" is a behavior question only when "correct" means "the tool was called with these arguments, given these inputs." The structured record makes that answerable.

Format

A ledger file is newline-delimited JSON (NDJSON): exactly one header record, then one tool_call record per call, in call order. Payloads are redacted with the same redactor the reporters and cassettes use before anything is written.

The records are validated by schemas/session-ledger-v1.json.

{"type":"header","schema_version":"v1","session_id":"019...","run_id":"019...","started_at":"2026-06-05T12:00:00Z","mcptest_version":"1.0.0","suite":"tests/agent.yml"}
{"type":"tool_call","session_id":"019...","agent_id":null,"hop_index":0,"tool_name":"search","server":"web","params":{"q":"weather sacramento"},"result":{"content":[{"type":"text","text":"..."}]},"is_error":false,"inputs_digest":"a1b2c3d4e5f60718","started_at":"2026-06-05T12:00:01Z","duration_ms":142,"caller":"direct"}
{"type":"tool_call","session_id":"019...","agent_id":null,"hop_index":1,"tool_name":"get_weather","server":"weather","params":{"city":"Sacramento"},"result":{"content":[{"type":"text","text":"72F"}]},"is_error":false,"inputs_digest":"f0e1d2c3b4a59687","started_at":"2026-06-05T12:00:02Z","duration_ms":88,"caller":"direct"}

Fields

Every tool_call record carries the structure eval queries:

session_id: groups every record from one run or user task.
agent_id: which agent made the call. Set in multi-agent (planner/worker) runs so a per-agent baseline does not conflate behaviors; null for single-agent runs.
hop_index: zero-based position of the call in the run. Lets a baseline assert a call happened at a specific step, after the agent saw the output of the prior step, not just "at some point."
tool_name, server: the bare tool name and the server it routed to.
params: the structured arguments, so you assert the args, not just the tool identity.
result, is_error: the tool result and whether it was flagged an error.
inputs_digest: a stable fingerprint of the context the agent had in front of it when it made the call, so a baseline can assert the inputs were identical across runs (a determinism check) without storing the full context.
started_at, duration_ms, caller: timing and how the call was issued.

The same fields are exposed on the agent run envelope for inline assertions: tool_calls[i].hop_index, tool_calls[i].agent_id, tool_calls[i].inputs_digest (see yaml-reference.md's agent target grammar), alongside tool_names and redundant_tool_calls.

CLI: emit and diff

mcptest ledger emit turns a saved agent run envelope (the tool_calls shape mcptest produces, for example a single agent test extracted from mcptest run --reporter json) into a ledger:

mcptest ledger emit envelope.json --session-id run-42 --output baseline.ndjson

mcptest ledger diff gates an actual run against a baseline. It compares the tool-call shape position by position, per agent_id: a different tool at a hop is a remove plus an add, a matching tool with different params is a param change. The command exits non-zero when the number of divergences exceeds --max-diff (default 0, exact match required), so it drops straight into CI:

mcptest ledger diff baseline.ndjson actual.ndjson --max-diff 0

  - removed  hop 1: fetch
  + added    hop 1: delete
ledger diff: 2 divergence(s) exceed --max-diff 0

This grades behavior at the tool boundary, not prose: the baseline is a recorded trajectory, CI captures a fresh ledger, and the diff fails the build when the agent stops doing the right thing.

Scope

The ledger captures one thing and leaves the rest to downstream tooling:

mcptest emits the ledger artifact for a single local run, and owns this schema as the stable contract. It is "let me see and diff what crossed the wire in my own session," a sibling of cassettes.
Downstream consumers handle anything cross-session: aggregation, a durable queryable store, a tamper-evident signed audit chain, identity binding, and live policy enforcement. Those are out of scope for this artifact.

Because the schema is defined here, every emitter (the mcptest CLI in test or replay, a runtime proxy in production) and every consumer speaks one contract.