Name-free discovery and orchestration diagnostics
Most agent tests name the tool the model should call, then check that it called it. That measures execution, not discovery. A name-free scenario removes the name: the prompt states only the user's intent, names no tool and no server, and the agent has to find the path itself. This page explains how to author an intent-only scenario, how the discovered path is scored against equal-function classes, and how the five orchestration diagnostics grade the run. Every check here is objective and runs no model, so the same recorded trace always yields the same numbers.
This is the discovery axis from MCP-Atlas (arXiv:2602.00933), which separates scenarios whose prompt names the tool from scenarios whose prompt states only the intent.
Declaring an intent-only scenario
Two blocks make a scenario name-free. The discovery: block declares the promise, and the equal_function_sets: block says what counts as reaching each capability.
discovery.name_free: trueis the author's declaration that the prompt names no tools and no servers. It is a promise about the prompt, not a transform: it tells a reader (and a future linter) that the scenario forces genuine discovery, and it signals that the discovered tool path is to be judged against equal-function classes rather than against a named expected tool. An absentname_free:defaults to false.equal_function_sets:groups the interchangeable tools into named classes, so any member of a class counts as reaching that capability. This is what makes scoring possible without naming a single expected tool: the discovered path is correct when it lands in the right class. See Tool-selection F1 via equal-function sets for the precision, recall, and F1 rules.
The assertion targets the discovered path through the classes. Because no single tool is named, the gate floors tool_selection.recall (did the run reach each declared capability) rather than asserting a specific tool id.
The five orchestration diagnostics
A run can reach the right capability yet waste calls, call the right tool with malformed arguments, or fail once and never recover. Collapsing that into one number hides where the agent went wrong, so MCP-Atlas scores five independent axes. The orchestration: block folds the recorded tool-call trace (plus the declared equal-function classes, for the discovery axis) into five assertable targets. Each is an integer percent from 0 to 100, and each is computed deterministically with no model in the loop. The exact rule for each:
orchestration.discovery: did the agent reach the right capability class at all. This reuses the equal-function-set recall: of the declared expected classes, the percent that at least one trace call hit. A name-free scenario scores its discovered path on this axis. With two declared classes, reaching one of them scores 50; reaching both scores 100.orchestration.parameterization: of all tool calls in the trace, the percent whose args are a non-empty JSON object. Argument schemas are not available offline, so this measures argument presence, not schema validity. A call with no args object, or an empty{}, counts against this axis. An empty trace scores 100 (there is no under-parameterized call).orchestration.syntax: of all tool calls in the trace, the percent that are structurally well formed: a non-empty tool-name string paired with JSON object args (an object, possibly empty). A call with a blank name or non-object args is malformed and counts against this axis. An empty trace scores 100.orchestration.error_recovery: of the calls that returned an error in the trace, the percent that are followed later in the trace by a successful call to the same tool or to an equivalent tool in the same declared class. When the trace contains no errors, this axis is 100 (there is nothing to recover from).orchestration.efficiency: the ratio of the optimal path length (the number of declared expected classes) to the observed call count, as a percent capped at 100. Calling exactly one tool per expected class scores 100; extra or wasted calls lower it. With no expected classes declared, this axis is 0. Two classes optimal over three observed calls rounds to 67.
All five are objective. None consult a model, so a green gate stays reproducible and free.
Inline example suite
This is the name-free scenario from examples/name-free-discovery/. The prompt names no tool and no server. Two mock servers each expose two interchangeable tools, grouped into a search class and a fetch class. The servers are served by mcptest mock from static manifests, so the run is deterministic and offline (see mcptest mock for the manifest format).
# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json
servers:
catalog:
command: ["mcptest", "mock", "--tools-from", "./servers/catalog.yml"]
fulfillment:
command: ["mcptest", "mock", "--tools-from", "./servers/fulfillment.yml"]
agents:
- name: intent-only discovery across two servers
model: claude-sonnet-4-5
servers: [catalog, fulfillment]
runs: 5
prompt: >
Find the current population of Paris and pull the source document for it.
discovery:
name_free: true
equal_function_sets:
classes:
- name: search
members:
- catalog.search
- catalog.web_search
- name: fetch
members:
- fulfillment.fetch
- fulfillment.get
expect:
- target: tool_selection.recall
matcher:
schema: { minimum: 100 }
orchestration:
expect:
- target: orchestration.discovery
matcher:
schema: { minimum: 100 }
- target: orchestration.syntax
matcher:
schema: { minimum: 100 }
- target: orchestration.error_recovery
matcher:
schema: { minimum: 100 }
- target: orchestration.efficiency
matcher:
schema: { minimum: 50 }
The referenced manifest servers/catalog.yml is a mock_server block:
mock_server:
name: catalog
tools:
- name: search
description: Search the catalog for matching records.
input_schema:
type: object
required: [query]
properties:
query: { type: string }
response:
content:
- type: text
text: "Top result for ${args.query}: record-42."
- name: web_search
description: Search the public web for matching pages.
input_schema:
type: object
required: [query]
properties:
query: { type: string }
response:
content:
- type: text
text: "Top web result for ${args.query}: https://example.com/record-42."
Each expect: item is the standard assertion shape: a target: paired with a matcher:. A floor on a percent is matcher: { schema: { minimum: N } }.
Objective, no model
Both the discovery score (equal-function-set recall) and all five orchestration diagnostics are computed from the recorded trace and the declared classes. None of them call a model, so they are free to run and byte-stable across platforms and providers. The scoring math is exercised offline by the core test:
cargo test -p mcptest-core --test orchestration_offline
The discovery: and orchestration: blocks are marked preview in the schema: the diagnostics engine computes the sub-scores today, and the runner wiring that emits them per scenario is still landing.
Related
- Tool-selection F1 via equal-function sets: the precision, recall, and F1 rules the discovery axis reuses.
mcptest mock: the deterministic mock server the example targets.- Narrative-vs-trace divergence: a sibling objective agent gate.