Tool-description quality benchmark

mcptest scores tool-description quality two ways already: the deterministic TDQS heuristics behind the tool_quality: check block, and the DESC-NNN description lints. Both answer "is this description well written?". Recent tool-description research asks a sharper question: does the description actually change which tool an agent picks? The mcptest tool-bench benchmark answers that one.

It is deterministic and model-free, so it runs in CI offline.

mcptest tool-bench run spec.yml --min-drop 0.3

What it measures

Given a tool catalog and a set of intents (a request plus the tool names that count as a correct pick), a deterministic lexical selector chooses the tool whose name and description best overlap each request. The benchmark reports:

Selection accuracy with a 95% Wilson confidence interval, so a small intent set does not read as more precise than it is. With one pick per intent, accuracy equals selection F1.
TDQS mean and min per-tool scores.
Smell categories: the DESC-NNN lint findings grouped into named smells (underspecified, verbose, uninformative, brittle, missing_examples, missing_return_format, missing_annotations, invalid_annotation).
Catalog token cost: a whitespace-token estimate of the name and description text.

Then it applies each declared mutation, one at a time, and reports how far selection dropped, which smells the edit introduced, and how many tokens it removed. That is the mutation-backed selection impact: proof that a description change improves or degrades selection, not just its style score.

The selection oracle is a deliberate proxy, a reproducible lower bound, not a claim about a specific model. It reacts to the same lexical signal a retriever or a model's tool router keys on, so degrading the keywords that made a description match an intent measurably moves the pick.

The spec

catalog:                      # inline tools/list, or use catalog_path:
  tools:
    - name: lookup_a
      description: Return the current weather forecast for a city, as JSON.
intents:
  - query: what is the weather forecast in Paris
    gold: [lookup_a]
mutations:
  - tool: lookup_a
    kind: blank               # blank | genericize | truncate | drop_keyword

catalog is an inline tools/list result (an object with a tools array, or a bare array of tools). Use catalog_path: to point at a saved snapshot instead. Each mutation names a tool and a kind:

`kind`	Parameters	Effect
`blank`		Remove the description entirely.
`genericize`		Replace it with a contentless phrase.
`truncate`	`chars`	Keep only the first `chars` characters.
`drop_keyword`	`word`	Remove every occurrence of a content word.

CLI

mcptest tool-bench run <spec> [--min-drop FRACTION] [--json]

--min-drop turns the run into a regression gate: it exits non-zero when the largest selection drop across mutations is below the threshold, i.e. the descriptions are not actually carrying the selection signal. --json emits the structured report.

Smells in `tool_quality:`

The smell categories are also exposed by the tool_quality: check block, so a mcptest run suite can gate on them directly:

tool_quality:
  - name: catalog smells
    server: my-server
    expect:
      - target: smells.underspecified
        matcher: { schema: { maximum: 0 } }
      - target: smell_total
        matcher: { schema: { maximum: 2 } }

See the runnable tool-description-bench example.

Tool-description quality benchmark

What it measures

The spec

CLI

Smells in tool_quality:

Smells in `tool_quality:`