Tool-description quality benchmark
mcptest scores tool-description quality two ways already: the deterministic TDQS heuristics behind the tool_quality: check block, and the DESC-NNN description lints. Both answer "is this description well written?". Recent tool-description research asks a sharper question: does the description actually change which tool an agent picks? The mcptest tool-bench benchmark answers that one.
It is deterministic and model-free, so it runs in CI offline.
mcptest tool-bench run spec.yml --min-drop 0.3
What it measures
Given a tool catalog and a set of intents (a request plus the tool names that count as a correct pick), a deterministic lexical selector chooses the tool whose name and description best overlap each request. The benchmark reports:
- Selection accuracy with a 95% Wilson confidence interval, so a small intent set does not read as more precise than it is. With one pick per intent, accuracy equals selection F1.
- TDQS mean and min per-tool scores.
- Smell categories: the DESC-NNN lint findings grouped into named smells (
underspecified,verbose,uninformative,brittle,missing_examples,missing_return_format,missing_annotations,invalid_annotation). - Catalog token cost: a whitespace-token estimate of the name and description text.
Then it applies each declared mutation, one at a time, and reports how far selection dropped, which smells the edit introduced, and how many tokens it removed. That is the mutation-backed selection impact: proof that a description change improves or degrades selection, not just its style score.
The selection oracle is a deliberate proxy, a reproducible lower bound, not a claim about a specific model. It reacts to the same lexical signal a retriever or a model's tool router keys on, so degrading the keywords that made a description match an intent measurably moves the pick.
The spec
catalog: # inline tools/list, or use catalog_path:
tools:
- name: lookup_a
description: Return the current weather forecast for a city, as JSON.
intents:
- query: what is the weather forecast in Paris
gold: [lookup_a]
mutations:
- tool: lookup_a
kind: blank # blank | genericize | truncate | drop_keyword
catalog is an inline tools/list result (an object with a tools array, or a bare array of tools). Use catalog_path: to point at a saved snapshot instead. Each mutation names a tool and a kind:
kind | Parameters | Effect |
|---|---|---|
blank | Remove the description entirely. | |
genericize | Replace it with a contentless phrase. | |
truncate | chars | Keep only the first chars characters. |
drop_keyword | word | Remove every occurrence of a content word. |
CLI
mcptest tool-bench run <spec> [--min-drop FRACTION] [--json]
--min-drop turns the run into a regression gate: it exits non-zero when the largest selection drop across mutations is below the threshold, i.e. the descriptions are not actually carrying the selection signal. --json emits the structured report.
Smells in tool_quality:
The smell categories are also exposed by the tool_quality: check block, so a mcptest run suite can gate on them directly:
tool_quality:
- name: catalog smells
server: my-server
expect:
- target: smells.underspecified
matcher: { schema: { maximum: 0 } }
- target: smell_total
matcher: { schema: { maximum: 2 } }
See the runnable tool-description-bench example.