Tool-selection and surface metrics

These are the deterministic, model-free gates that score how an agent picks tools and how much the tool surface costs it. Every metric here reads a recorded tool-call trace (or, where noted, the live tool catalog) and produces a number you gate on in CI. The same trace always yields the same number, so a green gate stays reproducible and free.

Each section follows the same shape: what the metric measures, the YAML block that turns it on under an agent test, and how to read the result. For the companion reliability and trace gates (trajectory, golden path, stability, fault recovery, narrative-vs-trace) see Reliability and trace metrics. For when to reach for a deterministic gate versus a model-graded judge, see Evaluation: judges, juries, and rubrics.

Selection F1 (`equal_function_sets:`)

What it measures. Whether an agent picked the right tool in a multi-server scenario, scored against capability classes rather than exact tool ids. An equal-function set is a named group of tools that accomplish the same job; any member of the set is an acceptable choice. If three servers each expose a web-search tool, those three form one class, and selecting any one counts as correct. The gate reports precision, recall, and F1, so a miss (a needed capability never reached) and an extra (an unneeded tool called) are both visible. This is the idea from MSC-Bench (arXiv:2510.19423).

Because the check is "did an observed call land in the right class," it reads the trace alone: no judge model, no token cost, no cross-provider drift.

The counting rule. The implementation walks the observed calls against the declared classes and applies one rule. Each class is consumed at most once.

A class is a true positive the first time an observed call names a member of that class.
A class with no observed member is a false negative.
An observed call that matches no class, and is not a repeat of an already-matched class, is a false positive.
Repeated interchangeable members neither inflate nor penalize the score. Calling two members of the same class still counts as one true positive.

A member id with a server. prefix matches the full qualified id, so two servers that expose an identically named tool stay distinct. A bare tool id (no server. prefix) matches that tool on any server, which is convenient for a single-server scenario.

Precision is true positives over (true positives + false positives): of the tools the agent chose, how many were on target. Recall is true positives over (true positives + false negatives): of the capabilities the scenario required, how many the agent reached. F1 is their harmonic mean. All three are reported as integers from 0 to 100. No declared classes and no observed calls scores 100 (vacuously perfect); a zero denominator yields 0 rather than a not-a-number value.

The YAML. Declare the gate inside an agent test with an equal_function_sets: block: a required classes: list and an optional expect: list.

tests:
  - name: research-agent picks search then fetch
    type: agent
    agent: researcher
    runs: 1
    equal_function_sets:
      classes:
        - name: search
          members:
            - brave.web_search
            - google.search
        - name: fetch
          members:
            - http.get
      expect:
        - tool_selection.f1: { ">=": 80 }

When the expect: list is omitted or empty, the block applies one default: tool_selection.f1 >= 50 (a selection F1 under half is more wrong than right). The three assertable targets are tool_selection.f1, tool_selection.precision, and tool_selection.recall, each an integer percent gated with the standard matchers.

How to read it. A run that calls brave.web_search and http.get against the two classes above scores 2 true positives, 0 false positives, 0 false negatives: precision 100, recall 100, F1 100. A run that calls google.search and shell.exec (which is in no class) scores 1 true positive, 1 false positive (the shell.exec call), and 1 false negative (the fetch class never satisfied): precision 50, recall 50, F1 50. The report names fetch as the missed class and shell.exec as the unexpected tool, so you see exactly which capability the agent skipped and which extra tool it reached for.

When a test runs more than once (runs: above 1), the gate micro-averages across runs: it sums true positives, false positives, and false negatives over every run, then derives the three percentages from those totals. The sums are order-independent, so a fixed set of run traces always produces the same aggregate.

Pass^k tool selection (`tool_selection:`)

What it measures. How often a model reaches for the expected tool across N runs, and how many tokens it burns doing so. A single agent run is a coin flip: a model that picks the right tool once might have been lucky. Run the same prompt ten times and the rate, not the single outcome, tells you whether the integration is reliable. The tool_selection: floor asserts both halves at once: a minimum selection rate AND an optional ceiling on per-run tokens.

Run more than once. Add runs: N to an agent test to dispatch the same (agent, model) pair N times. runs: defaults to 1 (legacy single-run behavior). Values above 1 produce one reporter row per run, named weather selection #1 through #N. When the test also fans across a model matrix, the run index is appended to the per-model row, for example weather selection [gpt-5] #3. A runs: 0 is clamped up to 1.

The YAML.

agents:
  - name: weather selection
    model: claude-sonnet-4-5
    servers: [weather]
    prompt: What is the weather in Sacramento?
    runs: 10
    tool_selection:
      expected_tool: get_weather
      min_selection_rate: 0.8
      max_total_tokens: 2000

Field	Required	Meaning
`expected_tool`	yes	The tool the model is supposed to call. A run "selects" it when the name appears in any `tool_calls[].name`.
`min_selection_rate`	yes	The minimum fraction of runs (0.0 to 1.0) that must select the tool.
`max_total_tokens`	no	A per-run cap on `conversation.tokens.total`. Omit it for no token budget.

The floor is independent of the per-run expect: block: expect: asserts the shape of any one run, tool_selection: asserts the behavior across all of them.

How to read it. The aggregation reports three numbers per (agent, model) pair:

selection rate: selected / runs. How often the model reached for the expected tool, ignoring cost.
pass^k: the fraction of runs that BOTH selected the tool AND stayed within the token budget. This is the headline metric. A run that picks the right tool but blows the budget does not count, because correctness bought with an unbounded budget is not correctness you can rely on.
tokens (median / max): the median and the largest conversation.tokens.total across the runs. The pair tells one expensive outlier (median well below max) apart from uniformly heavy spend (median near max).

The CLI prints one cost-vs-accuracy line per pair:

tool-selection floor [PASS] weather selection: selection 9/10 (90%), pass^k 90%, tokens 1520 median / 1840 max

The floor passes when the selection rate meets min_selection_rate AND every run stayed within max_total_tokens (when set). The budget check is all-or-nothing: one run over the cap fails the floor even at a 100% selection rate. When a floor is not met, mcptest run names the condition that failed and lists the individual runs that missed:

FLOOR weather selection: selection rate 60% is below the 80% floor (6 of 10 runs selected `get_weather`)
FLOOR weather selection: 3 of 10 runs exceeded the 2000-token budget (worst run 3120 tokens)
  run 4: 2400 tokens, over budget
  run 7: did not select `get_weather`, called search

mcptest run exits with code 1 on a breach, the same code a failing test returns. The progress line goes to stderr, so --reporter json stays clean. The selection rate, pass^k, token distribution, and per-run detail land in the JSON report under tool_selection. To see how a model drifted between two runs, diff the saved reports with mcptest report current.json --compare baseline.json, which prints a per-case delta and flags a floor flip.

One limitation: multi-run is a live-dispatch feature today. When a recorded cassette is replayed (mcptest run with a cassette on disk and no --record), the agent test runs once regardless of runs:. For now, use the floor against live models. A runnable example is in examples/agent-multi-run-tool-selection.yml.

Distractor accuracy (`distractors:`)

What it measures. Whether the agent still selects correctly when the candidate list is padded with N irrelevant or near-duplicate tools. A real deployment rarely presents exactly the tools a task needs, and a model that is perfect on a clean list can fall apart when the list is crowded. The scoring is objective and runs no model: a choice is correct when its id is in the declared correct set, and a distractor when its id is one of the injected distractors. This is the tool-overload setup from MCPAgentBench (arXiv:2512.24565) and MCP-Atlas (arXiv:2602.00933).

Two distractor sources. The distractors: block declares how many distractors to inject and where they come from:

from: catalog draws from a bundled catalog of plausible-but-irrelevant tools (weather, currency, calendar, and such).
from: near_duplicate synthesizes name-collision variants of the real tools named in of: (a _v2 or _internal variant, casing, pluralization). of: is required for this source.

Setting count: 0 is the no-noise baseline; raising it injects more look-alikes, and a robust agent holds its accuracy as the count climbs. Running the same scenario at several counts traces the accuracy-vs-distractor-count curve the benchmarks report. An optional complexity: tag (serial or parallel) stratifies the reporting by invocation complexity; it is descriptive metadata and does not change the accuracy math.

The accuracy rule. Per chosen id, the scorer counts a correct hit when the id is in the correct set, a distractor hit when it is in the distractor set, and ignores everything else. Accuracy is the integer percent chose_correct * 100 / chose_in_scope where chose_in_scope = chose_correct + chose_distractor. A run that chose nothing in scope scores 0, except the vacuous case of zero declared correct ids, which scores 100.

The YAML.

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json
servers:
  catalog:
    command: ["mcptest", "mock", "--tools-from", "./servers/catalog.yml"]

agents:
  - name: resists bundled distractors
    model: claude-sonnet-4-5
    servers: [catalog]
    runs: 4
    prompt: Find products that match the keyword "notebook".
    distractors:
      count: 4
      source:
        from: catalog
      correct: [catalog.search_products]
      complexity: serial
      expect:
        - target: distractors.accuracy
          matcher:
            schema: { minimum: 80 }
        - target: distractors.chose_distractor
          matcher:
            schema: { maximum: 0 }

A near-duplicate scenario derives look-alikes from the real tool names:

    distractors:
      count: 3
      source:
        from: near_duplicate
        of: [search_products, get_product]
      correct: [catalog.search_products, catalog.get_product]
      complexity: parallel

The block exposes two assertable targets: distractors.accuracy (the selection accuracy percent) and distractors.chose_distractor (how many injected distractors were chosen, the headline failure signal). An empty or omitted expect: applies the default gate distractors.accuracy >= 50. The example targets a deterministic mcptest mock server, so the run is offline; a runnable version is in examples/distractor-tools/.

The certified accuracy floor (distractors.certified_lower). A single pass rate is a weak predictor under distraction: a model that scored 80 percent over five runs may still collapse below 30 percent on the next batch. The gate also exposes distractors.certified_lower, the Clopper-Pearson exact lower bound on the per-run success rate at 95 percent confidence (a run succeeds when the agent selected only correct tools). The bound is the floor the true accuracy clears at that confidence given the runs observed, so a suite can gate on the guarantee rather than the lucky average:

distractors:
  count: 8
  source: { from: catalog }
  correct: [google.search]
  expect:
    - target: distractors.certified_lower
      matcher: { schema: { minimum: 20 } }

The exact (Clopper-Pearson) bound is used rather than the normal-approximation (Wald) interval because Wald is badly miscalibrated at the extremes, exactly where distractor saturation pushes accuracy. The cost is more runs: a tight certified floor needs a real sample (one perfect run certifies only about 5 percent at 95 percent confidence), so raise runs: before asserting a high floor. Reach for certified_lower on a release or safety claim; the plain distractors.accuracy floor is enough for a quick regression check. The method follows LLMCert-T (arXiv:2510.03992), which shows clean accuracy of 71 to 77 percent collapsing to a certified bound near 0.20 under distractor saturation.

The distractors: block is marked x-mcptest-status: preview in the schema: the engine scores a recorded selection today, and the runtime injection of distractors into the presented tool list is preview-stage runner wiring.

Name-free discovery (`discovery:` and `orchestration:`)

What it measures. Whether an agent can find the right tool from intent alone, with no tool and no server named in the prompt. Most agent tests name the tool the model should call, then check that it called it, which measures execution, not discovery. A name-free scenario removes the name: the prompt states only the user's intent, and the agent has to find the path itself. This is the discovery axis from MCP-Atlas (arXiv:2602.00933).

Declaring an intent-only scenario. Two blocks make a scenario name-free:

discovery.name_free: true is the author's declaration that the prompt names no tools and no servers. It is a promise about the prompt, not a transform: it signals that the discovered tool path is judged against equal-function classes rather than against a named expected tool. An absent name_free: defaults to false.
equal_function_sets: groups the interchangeable tools into named classes, so any member of a class counts as reaching that capability (see Selection F1 for the precision, recall, and F1 rules). Because no single tool is named, the gate floors tool_selection.recall rather than asserting a specific tool id.

The five orchestration diagnostics. A run can reach the right capability yet waste calls, call the right tool with malformed arguments, or fail once and never recover. The orchestration: block folds the recorded trace (plus the declared classes, for the discovery axis) into five assertable targets, each an integer percent computed with no model in the loop:

orchestration.discovery: did the agent reach the right capability class. Reuses the equal-function-set recall: of the declared expected classes, the percent that at least one trace call hit.
orchestration.parameterization: of all calls, the percent whose args are a non-empty JSON object. Argument schemas are not available offline, so this measures argument presence, not schema validity. An empty trace scores 100.
orchestration.syntax: of all calls, the percent that are structurally well formed (a non-empty tool-name string paired with JSON object args). An empty trace scores 100.
orchestration.error_recovery: of the calls that returned an error, the percent followed later by a successful call to the same tool or an equivalent in the same class. No errors in the trace scores 100.
orchestration.efficiency: the ratio of the optimal path length (the number of declared expected classes) to the observed call count, capped at 100. Calling exactly one tool per expected class scores 100; extra calls lower it. No expected classes scores 0.

The YAML. The prompt names no tool and no server; two mock servers each expose two interchangeable tools, grouped into a search class and a fetch class.

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json
servers:
  catalog:
    command: ["mcptest", "mock", "--tools-from", "./servers/catalog.yml"]
  fulfillment:
    command: ["mcptest", "mock", "--tools-from", "./servers/fulfillment.yml"]

agents:
  - name: intent-only discovery across two servers
    model: claude-sonnet-4-5
    servers: [catalog, fulfillment]
    runs: 5
    prompt: >
      Find the current population of Paris and pull the source document for it.
    discovery:
      name_free: true
    equal_function_sets:
      classes:
        - name: search
          members:
            - catalog.search
            - catalog.web_search
        - name: fetch
          members:
            - fulfillment.fetch
            - fulfillment.get
      expect:
        - target: tool_selection.recall
          matcher:
            schema: { minimum: 100 }
    orchestration:
      expect:
        - target: orchestration.discovery
          matcher:
            schema: { minimum: 100 }
        - target: orchestration.syntax
          matcher:
            schema: { minimum: 100 }
        - target: orchestration.error_recovery
          matcher:
            schema: { minimum: 100 }
        - target: orchestration.efficiency
          matcher:
            schema: { minimum: 50 }

How to read it. Both the discovery score (equal-function-set recall) and all five orchestration diagnostics are computed from the recorded trace and the declared classes, so they are byte-stable across platforms and providers. A name-free scenario that reaches both declared classes scores orchestration.discovery: 100; one that reaches one of two scores 50. Two classes optimal over three observed calls rounds efficiency to 67. The example targets deterministic mcptest mock servers; a runnable version is in examples/name-free-discovery/.

The discovery: and orchestration: blocks are marked preview in the schema: the diagnostics engine computes the sub-scores today, and the runner wiring that emits them per scenario is still landing.

Token efficiency (`token_efficiency:`)

What it measures. The per-correct price of the tool catalog, in tokens and dollars, alongside the selection F1. A large tool surface is not free: every tool you advertise (its name, description, and input schema) costs input tokens on every model call, and a wide catalog can drown the model in near-duplicate distractors. "Did the agent pick the right tool" is only half the question; the other half is "how much did each correct selection cost." The F1 alone hides a fat catalog that scores well while burning tokens.

The gate scores the same equal-function selection F1 as equal_function_sets:, then folds in two cost signals:

Tool-surface tokens. The merged catalog the model actually saw, tokenized the way mcptest doctor counts a catalog (each tool's name, description, and inputSchema JSON, with the cl100k_base BPE tokenizer). This is the per-call input-token price of the surface.
Run cost. The dollars the run spent, summed across the cell's runs (real provider runs only).

The YAML. The block reuses the equal_function_sets: shape: a required non-empty classes: list and an optional expect: over the token_efficiency.* targets. An empty or omitted expect: applies the default token_efficiency.f1 >= 50.

agents:
  - name: search stays efficient on a wide catalog
    model: claude-sonnet-4-5
    servers: [search-server]
    runs: 3
    prompt: >
      Find the document about quarterly revenue and read it back to me.
    token_efficiency:
      classes:
        - name: search
          members:
            - search
        - name: fetch
          members:
            - fetch
      expect:
        # Selection must not collapse on the padded catalog.
        - target: token_efficiency.f1
          matcher:
            schema: { minimum: 60 }
        # Each correct selection costs no more than 1500 tool-surface tokens.
        - target: token_efficiency.tokens_per_correct
          matcher:
            schema: { maximum: 1500 }

How to read it. The assertable targets:

Target	Meaning
`token_efficiency.f1`	Selection F1, integer percent `0..=100`.
`token_efficiency.tool_surface_tokens`	Tokens describing the tool surface.
`token_efficiency.correct_selections`	True positives pooled across runs.
`token_efficiency.cost`	Dollars spent on the run.
`token_efficiency.tokens_per_correct`	Tool-surface tokens per correct selection.
`token_efficiency.cost_per_correct`	Dollars per correct selection.

A letter grade rides alongside, keyed on the F1 (A at 90, B at 80, C at 70, D at 60, F below). The grade is a human-readable signal printed with the result; gate on token_efficiency.f1 instead. The per-correct figures are undefined when nothing was selected correctly, and cost_per_correct is undefined on a free run; an undefined figure resolves to absent rather than dividing by zero.

Reach for token_efficiency: when you are tuning a tool catalog and want to keep accuracy from masking bloat: trimming verbose descriptions, removing near-duplicate tools, or splitting a monolithic server should move tokens_per_correct down without moving the F1. The gate needs the live tool surface to count its tokens, so it only fires on a live run; a cassette replay carries no catalog token count and the gate does not fire on replay. A runnable suite is in examples/token-efficiency/.

Tool-description quality (`tool_quality:` and the doctor lint)

What it measures. Whether a server's tool descriptions are well written, on two surfaces. Agents pay tokens for every tool description they read, and bad descriptions silently degrade tool-selection accuracy (Hasan et al., arXiv:2602.14878; arXiv:2602.18914). The tool_quality: block gates a server's catalog in a test suite, with one PASS/FAIL row and a non-zero exit when the bar is missed. The mcptest doctor --lint-descriptions lint emits per-tool findings for an interactive read. They share the same heuristics.

Gate on it: the tool_quality: block. A top-level tool_quality: list connects to a server, scores every description with the deterministic TDQS heuristics, and gates on the result:

servers:
  filesystem:
    command: ["npx", "-y", "@modelcontextprotocol/server-filesystem", "/tmp"]

tool_quality:
  - name: "tool descriptions meet the quality bar"
    server: filesystem
    expect:                          # optional; defaults apply if omitted
      - target: min_score
        matcher: { schema: { minimum: 0.50 } }
      - target: mean_score
        matcher: { schema: { minimum: 0.70 } }
      # fail on any research-backed critical finding (the default, made explicit)
      - target: critical_count
        matcher: { schema: { maximum: 0 } }
      # power user: gate one tool by name
      - target: tool["read_file"].score
        matcher: { schema: { minimum: 0.60 } }

The targets:

min_score: the worst tool's score, 0..1. One badly written tool drags the catalog, so this is the "no tool falls below the floor" check.
mean_score: the average tool score, 0..1.
tool["<name>"].score: one named tool's score, for gating a critical tool harder than the catalog average.
critical_count: how many DESC-NNN findings landed at Critical severity. One bad tool can carry a critical while the averaged scores still look fine, so this is the "no tool trips a research-backed critical rule" check.
warning_count: how many DESC-NNN findings landed at Warning severity. Reported for tuning; not in the default gate.
schema_criticals / schema_warnings: strict input-schema findings (SCH-NNN), for example a property with no type or enum. Neither schema target is in the default gate; declare them in expect: to opt in. See tool-schema-lint.md.

Omit expect: and the engine applies the defaults: min_score >= 0.50, mean_score >= 0.70, and critical_count <= 0. critical_count and mean_score watch different things: the scores are an averaged 0..1 signal over six heuristics, so a single tool that duplicates its name (DESC-003) or ships a one-word description (DESC-001) can still leave the mean above 0.70. The lint count catches that tool by severity rather than by average. The scoring is deterministic (description presence and length, parameter documentation, conciseness, return-format and annotation signals). The worked suite is examples/tool-quality.yml.

The lint: mcptest doctor --lint-descriptions. For an interactive read rather than a gate, the doctor lint runs twelve rules against the catalog returned by tools/list and emits findings. Each finding has a severity (Pass, Warning, Critical), a stable rule ID, a message, and an optional suggestion.

Rule ID	What it catches
DESC-001	Description is empty or under 20 chars
DESC-002	Description over 500 chars (probably missing inputSchema constraints)
DESC-003	Description equals the tool name (no signal)
DESC-004	Description contains no common verb (agent cannot categorize)
DESC-005	Description uses positional phrases ("see above", "previous tool") that mean nothing in a flat catalog
DESC-006	A required argument has no description
DESC-007	An enum-typed argument has a description but does not mention the enum values
DESC-008	An argument description is longer than the tool description (inverted information density)
DESC-009	A tool with a non-trivial input schema provides no usage examples
DESC-010	The description states an action but never says what the tool returns
DESC-011	An annotation hint (readOnlyHint, destructiveHint, idempotentHint, openWorldHint) is present with a non-boolean value
DESC-012	A tool declares no annotations object at all
DESC-013	A string argument lists its allowed values in the description but declares no `enum`

Severities are Critical (violates a research-backed quality rule; the agent will likely struggle to use the tool reliably), Warning (imperfect but workable), and Pass (meets the bar for that rule). A clean tool with no rule firing gets one synthetic Pass finding so the report represents every checked tool. A few rules worth knowing:

DESC-009 fires (Warning) when a tool has a non-trivial input schema (more than one property, or a single property that is required or not a plain string) yet ships no examples. Anthropic's advanced-tool-use guidance reports that adding worked examples raised accuracy from 72% to 90% on complex parameter handling. The rule treats a tool as documented if it carries an examples array, or any parameter declares examples, a singular example, or a default.
DESC-010 fires (Warning) when a non-empty description mentions no return/output/result keyword and the tool declares no outputSchema. Good descriptions say what comes back, for example "Returns a list of order objects, each with id, total, and status." The heuristic is coarse (it can tell whether a description gestures at a return shape at all, not whether the shape is accurate), which is why it stays at Warning.
DESC-011 fires (Warning) when one of the four annotation hints is present but holds a non-boolean value. A conformance check, not a heuristic: it inspects only the JSON type.
DESC-012 fires (Warning) when a tool declares no annotations object at all, so a client cannot tell whether a call is read-only, destructive, or idempotent. An annotations object that is present, even empty, passes.
DESC-013 fires (Warning) on a string (or untyped) argument whose schema declares no enum but whose description spells out a fixed set of allowed values ("the new status, one of open, closed, or pending"). The fix is to add the enum. It is the inverse of DESC-007.

Use tool_quality: in a suite to gate the build; use mcptest doctor --lint-descriptions interactively to see which rule each tool tripped. Fix the findings the lint reports, then watch the tool_quality: scores climb. The research bibliography is in research-references.md.

Tool-description benchmark (`mcptest tool-bench`)

What it measures. Whether a tool description actually changes which tool an agent picks, which is a sharper question than "is this description well written?" Given a tool catalog and a set of intents (a request plus the tool names that count as a correct pick), a deterministic lexical selector chooses the tool whose name and description best overlap each request. Then the benchmark applies each declared mutation, one at a time, and reports how far selection dropped, which smells the edit introduced, and how many tokens it removed. That is the mutation-backed selection impact: proof that a description change improves or degrades selection, not just its style score. It is deterministic and model-free, so it runs in CI offline.

The selection oracle is a deliberate proxy, a reproducible lower bound, not a claim about a specific model. It reacts to the same lexical signal a retriever or a model's tool router keys on, so degrading the keywords that matched an intent measurably moves the pick.

The benchmark reports selection accuracy with a 95% Wilson confidence interval (with one pick per intent, accuracy equals selection F1), TDQS mean and min per-tool scores, the DESC-NNN lint findings grouped into named smells (underspecified, verbose, uninformative, brittle, missing_examples, missing_return_format, missing_annotations, invalid_annotation), and a whitespace-token estimate of the catalog cost.

The spec.

catalog:                      # inline tools/list, or use catalog_path:
  tools:
    - name: lookup_a
      description: Return the current weather forecast for a city, as JSON.
intents:
  - query: what is the weather forecast in Paris
    gold: [lookup_a]
mutations:
  - tool: lookup_a
    kind: blank               # blank | genericize | truncate | drop_keyword

catalog is an inline tools/list result (an object with a tools array, or a bare array of tools); use catalog_path: to point at a saved snapshot. Each mutation names a tool and a kind:

`kind`	Parameters	Effect
`blank`		Remove the description entirely.
`genericize`		Replace it with a contentless phrase.
`truncate`	`chars`	Keep only the first `chars` characters.
`drop_keyword`	`word`	Remove every occurrence of a content word.

How to read it and run it.

mcptest tool-bench run <spec> [--min-drop FRACTION] [--json]

--min-drop turns the run into a regression gate: it exits non-zero when the largest selection drop across mutations is below the threshold, i.e. the descriptions are not actually carrying the selection signal. --json emits the structured report.

The smell categories are also exposed by the tool_quality: check block, so a mcptest run suite can gate on them directly:

tool_quality:
  - name: catalog smells
    server: my-server
    expect:
      - target: smells.underspecified
        matcher: { schema: { maximum: 0 } }
      - target: smell_total
        matcher: { schema: { maximum: 2 } }

See the runnable tool-description-bench example.

Reliability and trace metrics: trajectory validation, golden path, stability, fault recovery, narrative-vs-trace.
Tool-edge coverage: the allow/deny policy gate.
Evaluation: judges, juries, and rubrics: the model-graded surface and the full scoring-method map.
mcptest mock: the deterministic mock server the examples target.

Tool-selection and surface metrics

Selection F1 (equal_function_sets:)

Pass^k tool selection (tool_selection:)

Distractor accuracy (distractors:)

Name-free discovery (discovery: and orchestration:)

Token efficiency (token_efficiency:)

Tool-description quality (tool_quality: and the doctor lint)

Tool-description benchmark (mcptest tool-bench)

Related