mcptest docs GitHub

Distractor tools and tool-overload scoring

A real MCP deployment rarely presents the agent with exactly the tools a task needs. The candidate list is padded with irrelevant or near-duplicate tools, and a capable agent must still pick the correct one amid the noise. This page explains how to inject N distractor tools into a scenario and assert that the agent still selects correctly. The scoring is objective and runs no model: a choice is correct when its id is in the declared correct set, and a choice is a distractor when its id is one of the injected distractors.

This is the tool-overload setup from MCPAgentBench (arXiv:2512.24565) and MCP-Atlas (arXiv:2602.00933), which inject distractor tools and report selection accuracy as a function of distractor count.

Two distractor sources

The distractors: block declares how many distractors to inject and where they come from. There are two sources:

Accuracy as a function of distractor count

The point of the setup is the degradation curve: selection accuracy tends to fall as the distractor count rises. Setting count: 0 is the baseline, no noise. Raising count injects more look-alikes and irrelevant tools, and a robust agent holds its accuracy as the count climbs. Running the same scenario at several counts traces the accuracy-vs-distractor-count curve the benchmarks report.

Serial vs parallel complexity

The optional complexity: tag stratifies the reporting by invocation complexity, the MCPAgentBench axis. It is descriptive metadata and does not change the accuracy math; it lets a report group degradation curves by complexity:

The accuracy rule

Given the agent's chosen tool ids, the set of correct ids, and the set of injected distractor ids, the scorer counts, per chosen id:

Matching is exact membership of the chosen id; there is no judge model. Accuracy is the integer percent chose_correct * 100 / chose_in_scope, where chose_in_scope = chose_correct + chose_distractor. A run that chose nothing in scope scores 0 (it never selected the correct tool), except the vacuous case of zero correct ids declared, which scores 100. The block exposes two assertable targets:

An empty or omitted expect: applies the default gate distractors.accuracy >= 50.

Inline example suite

This is from examples/distractor-tools/. One mock server, catalog, exposes real tools; the distractors: block pads the candidate list and asserts the agent still picks the right one. The server is served by mcptest mock from a static manifest, so the run is deterministic and offline (see mcptest mock for the manifest format).

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json
servers:
  catalog:
    command: ["mcptest", "mock", "--tools-from", "./servers/catalog.yml"]

agents:
  - name: resists bundled distractors
    model: claude-sonnet-4-5
    servers: [catalog]
    runs: 4
    prompt: Find products that match the keyword "notebook".
    distractors:
      count: 4
      source:
        from: catalog
      correct: [catalog.search_products]
      complexity: serial
      expect:
        - target: distractors.accuracy
          matcher:
            schema: { minimum: 80 }
        - target: distractors.chose_distractor
          matcher:
            schema: { maximum: 0 }

A near-duplicate scenario derives look-alikes from the real tool names:

    distractors:
      count: 3
      source:
        from: near_duplicate
        of: [search_products, get_product]
      correct: [catalog.search_products, catalog.get_product]
      complexity: parallel

The referenced manifest servers/catalog.yml is a mock_server block:

mock_server:
  name: catalog
  tools:
    - name: search_products
      description: Search the product catalog by keyword.
      input_schema:
        type: object
        required: [query]
        properties:
          query: { type: string }
      response:
        content:
          - type: text
            text: "Products matching ${args.query}: sku-1, sku-2."
    - name: get_product
      description: Get one product by sku.
      input_schema:
        type: object
        required: [sku]
        properties:
          sku: { type: string }
      response:
        content:
          - type: text
            text: "Product ${args.sku}: in stock."

Each expect: item is the standard assertion shape: a target: paired with a matcher:. A floor on the accuracy percent is matcher: { schema: { minimum: N } }; a ceiling on the distractor count is matcher: { schema: { maximum: N } }.

Objective, no model

The selection scoring is exact membership of the chosen ids against the correct and distractor sets, so it is byte-stable and free of model calls. The scoring math, the bundled catalog, and the near-duplicate synthesizer are exercised offline by the core distractors eval module test.

The distractors: block is marked x-mcptest-status: preview in the schema: the engine scores a recorded selection today, and the runtime injection of the distractors into the presented tool list is preview-stage runner wiring.