Multi-run agent tests and the pass^k tool-selection gate

A single agent run is a coin flip. A model that picks the right tool once might pick it because the prompt was easy, because the temperature landed well, or because it got lucky. Run the same prompt ten times and the picture changes: now you can see how often the model actually reaches for the tool you expect, and how many tokens it burns doing so. That rate, not the single outcome, is what tells you whether the integration is reliable.

Run this example. examples/agent-multi-run-tool-selection.yml runs one prompt ten times and gates on a tool_selection: floor: a minimum selection rate plus a per-run token cap. Record once, then replay it in CI.

ANTHROPIC_API_KEY=... mcptest run --record --config examples/agent-multi-run-tool-selection.yml

The runs: count and the tool_selection: floor parse in mcptest-core::suite; the cross-run aggregation is the pure function mcptest_core::eval::selection::aggregate_selection; the gate runs at the CLI in mcptest run.

Running each agent test more than once

Add runs: N to an agent test to dispatch the same (agent, model) pair N times:

agents:
  - name: weather selection
    model: claude-sonnet-4-5
    servers: [weather]
    prompt: What is the weather in Sacramento?
    runs: 10

runs: defaults to 1, which is the legacy single-run behavior: nothing changes for suites that omit it. Values above 1 produce one reporter row per run, named weather selection #1 through weather selection #10, so each run is still a normal pass/fail line you can read and re-run. When the agent test also fans out across a model matrix, the run index is appended to the per-model row, for example weather selection [gpt-5] #3.

A runs: 0 is clamped up to 1. The runner never produces zero rows from a typo.

The tools-vs-tokens tradeoff

Picking the right tool is only half the story. A model can be made to pick the right tool more often by giving it a larger budget, more turns, or a longer system prompt, and at some point the cost of getting it right stops being worth it. The tool_selection: floor lets you assert both halves at once: a minimum selection rate AND an optional ceiling on the tokens each run is allowed to spend.

agents:
  - name: weather selection
    model: claude-sonnet-4-5
    servers: [weather]
    prompt: What is the weather in Sacramento?
    runs: 10
    tool_selection:
      expected_tool: get_weather
      min_selection_rate: 0.8
      max_total_tokens: 2000

Field	Required	Meaning
`expected_tool`	yes	The tool the model is supposed to call. A run "selects" it when the name appears in any `tool_calls[].name`.
`min_selection_rate`	yes	The minimum fraction of runs (0.0 to 1.0) that must select the tool.
`max_total_tokens`	no	A per-run cap on `conversation.tokens.total`. Omit it for no token budget.

The floor is independent of the per-run expect: block. You can use both: expect: asserts the shape of any one run, tool_selection: asserts the behavior across all of them.

What pass^k measures

The aggregation reports three numbers per (agent, model) pair:

selection rate: selected / runs. How often the model reached for the expected tool, ignoring cost.
pass^k: the fraction of runs that BOTH selected the tool AND stayed within the token budget. This is the headline metric. A run that picks the right tool but blows the budget does not count toward pass^k, because correctness bought with an unbounded budget is not correctness you can rely on.
max tokens: the largest conversation.tokens.total seen across the runs, so a single expensive outlier is visible.

The CLI prints these as a single cost-vs-accuracy line per pair:

tool-selection floor [PASS] weather selection: selection 9/10 (90%), pass^k 90%, max tokens 1840

How the gate fails the run

The floor passes when the selection rate meets min_selection_rate AND every run stayed within max_total_tokens (when set). The budget check is all-or-nothing: one run over the cap fails the floor even at a 100% selection rate, because that run proves the model can exceed the budget.

When a floor is not met, mcptest run exits with code 1, the same code a failing test returns. A floor breach is a failing outcome, not a configuration or budget error, so it reuses the test-failure code rather than minting a new one. This mirrors how --coverage-threshold gates: the CLI computes the metric, prints it, and returns a non-zero exit on violation. The gate only fires when a tool_selection: block is present; agent tests without one are unaffected.

Current limitation: cassette replay is single-run

Multi-run is a live-dispatch feature today. When a recorded cassette is replayed (mcptest run with a cassette on disk and no --record), the agent test runs once regardless of runs:. Wiring multi-run through the cassette layer (recording and replaying N traces per pair) is a follow-up. For now, use the floor against live models.