mcptest docs GitHub

Scenario 13: tool overload and selection under noise

A model that picks the right tool from a list of three may not pick it from a list of twenty. Real MCP deployments rarely present a clean catalog: the candidate list is padded with similar-sounding tools, and a capable agent has to find the one tool that actually does the job amid the look-alikes. This scenario measures that directly, using the hosted test server.

The hosted endpoint https://test.mcptest.sh/mcp?scenario=distractors serves one real tool, get_forecast(city), buried among n near-duplicate decoys. Only get_forecast returns a real forecast; every decoy errors with JSON-RPC -32601 (method not found) when called. The real tool is placed last in the catalog, so an agent that always grabs the first plausible match cannot win by position. Raise n and you chart how selection accuracy decays as the catalog grows.

This is an agent test: an agents: block drives a real model through the tool-use loop, and you assert on the trace, that the model called get_forecast and not a decoy. Because a real model is in the loop, this step needs a provider API key (for example ANTHROPIC_API_KEY). There is no offline stub for measuring selection under noise; the whole point is what the model actually does.

The YAML

Save this as tests/tool-overload.yml:

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json

servers:
  weather:
    url: "https://test.mcptest.sh/mcp?scenario=distractors&n=8"

agents:
  - name: forecast survives 8 decoys
    model: claude-sonnet-4-5
    servers: [weather]
    runs: 4
    prompt: What is the weather forecast for Sacramento?
    expect:
      - target: tool_calls[0].name
        matcher: { exact: get_forecast }
      - target: tool_names
        matcher:
          contains-all: [get_forecast]
      - target: tool_calls[0].args.city
        matcher: { regex: "(?i)sacramento" }

What is happening here:

If you also want the objective selection metric, add a tool_selection floor via the equal-function-set gate documented in docs/tool-selection-f1.md. Group the real tool into a one-member class and gate tool_selection.f1 so the report carries a number you can track over time, with no extra model calls.

Run it (note the provider key)

This is an agent run, so it dispatches a real model and needs a key for that model's provider:

ANTHROPIC_API_KEY=sk-ant-... mcptest run --config tests/tool-overload.yml

The model id claude-sonnet-4-5 auto-detects the Anthropic family and reads ANTHROPIC_API_KEY. Swap the model for a gpt-, gemini-, or mistral- id and set the matching key (OPENAI_API_KEY, GEMINI_API_KEY, MISTRAL_API_KEY) to measure a different model. If the key is missing, the run does not error: it falls back to a deterministic stub, which does not exercise real tool selection, so the result is meaningless for this scenario. Set the key.

To compare several models in one pass, list them under models: or run a --models sweep, and the report renders a per-model grid. See docs/models.md for the matrix form.

Sweep the catalog size (raise n)

The interesting result is not whether the model wins at n=8; it is how fast accuracy falls as the catalog grows. Run the same suite at several values of n and read the trend. The cleanest way is to keep n in a variable and override it per run:

servers:
  weather:
    url: "https://test.mcptest.sh/mcp?scenario=distractors&n=${n}"

variables:
  n:
    default: "8"
# Walk the catalog from a clean baseline up to the clamp ceiling.
for n in 1 4 8 16 20; do
  echo "=== n=$n ==="
  ANTHROPIC_API_KEY=sk-ant-... mcptest run \
    --var n=$n \
    --config tests/tool-overload.yml
done

n=1 is the easy baseline: one real tool, one decoy. Each step up crowds the catalog with more look-alikes. A robust model holds its first-pick accuracy as n climbs to the 20 ceiling; a brittle one starts probing decoys and the tool_calls[0].name assertion begins to fail. Plot the pass rate against n and you have the accuracy-decay curve the tool-overload benchmarks report (see docs/distractor-tools.md for the background and the offline, model-free scoring variant).

Expected output

A run with the key set, at n=8, where the model picks the real tool on every repeat:

mcptest run --config tests/tool-overload.yml

  PASS  forecast survives 8 decoys [claude-sonnet-4-5]    (4 runs, 2.9s)
          tool_calls[0].name      get_forecast
          tool_names              [get_forecast]
          tool_calls[0].args.city Sacramento

1 passed, 0 failed in 3.1s

When the model takes the bait, the strict first-pick assertion fails and names the decoy it chose:

  FAIL  forecast survives 8 decoys [claude-sonnet-4-5]    (4 runs, 3.4s)
          tool_calls[0].name: expected get_forecast, got get_forecast_v2

The decoy get_forecast_v2 is one of the synthesized look-alikes. The model reached for it first; had it called the tool, the server would have returned -32601. The lenient tool_names form would still pass here if the model recovered and called the real get_forecast on a later turn, which is why the two assertions answer different questions.

Troubleshooting

See also