mcptest docs GitHub

Tool-description quality benchmark

mcptest scores tool-description quality two ways already: the deterministic TDQS heuristics behind the tool_quality: check block, and the DESC-NNN description lints. Both answer "is this description well written?". Recent tool-description research asks a sharper question: does the description actually change which tool an agent picks? The mcptest tool-bench benchmark answers that one.

It is deterministic and model-free, so it runs in CI offline.

mcptest tool-bench run spec.yml --min-drop 0.3

What it measures

Given a tool catalog and a set of intents (a request plus the tool names that count as a correct pick), a deterministic lexical selector chooses the tool whose name and description best overlap each request. The benchmark reports:

Then it applies each declared mutation, one at a time, and reports how far selection dropped, which smells the edit introduced, and how many tokens it removed. That is the mutation-backed selection impact: proof that a description change improves or degrades selection, not just its style score.

The selection oracle is a deliberate proxy, a reproducible lower bound, not a claim about a specific model. It reacts to the same lexical signal a retriever or a model's tool router keys on, so degrading the keywords that made a description match an intent measurably moves the pick.

The spec

catalog:                      # inline tools/list, or use catalog_path:
  tools:
    - name: lookup_a
      description: Return the current weather forecast for a city, as JSON.
intents:
  - query: what is the weather forecast in Paris
    gold: [lookup_a]
mutations:
  - tool: lookup_a
    kind: blank               # blank | genericize | truncate | drop_keyword

catalog is an inline tools/list result (an object with a tools array, or a bare array of tools). Use catalog_path: to point at a saved snapshot instead. Each mutation names a tool and a kind:

kindParametersEffect
blankRemove the description entirely.
genericizeReplace it with a contentless phrase.
truncatecharsKeep only the first chars characters.
drop_keywordwordRemove every occurrence of a content word.

CLI

mcptest tool-bench run <spec> [--min-drop FRACTION] [--json]

--min-drop turns the run into a regression gate: it exits non-zero when the largest selection drop across mutations is below the threshold, i.e. the descriptions are not actually carrying the selection signal. --json emits the structured report.

Smells in tool_quality:

The smell categories are also exposed by the tool_quality: check block, so a mcptest run suite can gate on them directly:

tool_quality:
  - name: catalog smells
    server: my-server
    expect:
      - target: smells.underspecified
        matcher: { schema: { maximum: 0 } }
      - target: smell_total
        matcher: { schema: { maximum: 2 } }

See the runnable tool-description-bench example.