mcptest docs GitHub

Name-free discovery and orchestration diagnostics

Most agent tests name the tool the model should call, then check that it called it. That measures execution, not discovery. A name-free scenario removes the name: the prompt states only the user's intent, names no tool and no server, and the agent has to find the path itself. This page explains how to author an intent-only scenario, how the discovered path is scored against equal-function classes, and how the five orchestration diagnostics grade the run. Every check here is objective and runs no model, so the same recorded trace always yields the same numbers.

This is the discovery axis from MCP-Atlas (arXiv:2602.00933), which separates scenarios whose prompt names the tool from scenarios whose prompt states only the intent.

Declaring an intent-only scenario

Two blocks make a scenario name-free. The discovery: block declares the promise, and the equal_function_sets: block says what counts as reaching each capability.

The assertion targets the discovered path through the classes. Because no single tool is named, the gate floors tool_selection.recall (did the run reach each declared capability) rather than asserting a specific tool id.

The five orchestration diagnostics

A run can reach the right capability yet waste calls, call the right tool with malformed arguments, or fail once and never recover. Collapsing that into one number hides where the agent went wrong, so MCP-Atlas scores five independent axes. The orchestration: block folds the recorded tool-call trace (plus the declared equal-function classes, for the discovery axis) into five assertable targets. Each is an integer percent from 0 to 100, and each is computed deterministically with no model in the loop. The exact rule for each:

All five are objective. None consult a model, so a green gate stays reproducible and free.

Inline example suite

This is the name-free scenario from examples/name-free-discovery/. The prompt names no tool and no server. Two mock servers each expose two interchangeable tools, grouped into a search class and a fetch class. The servers are served by mcptest mock from static manifests, so the run is deterministic and offline (see mcptest mock for the manifest format).

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json
servers:
  catalog:
    command: ["mcptest", "mock", "--tools-from", "./servers/catalog.yml"]
  fulfillment:
    command: ["mcptest", "mock", "--tools-from", "./servers/fulfillment.yml"]

agents:
  - name: intent-only discovery across two servers
    model: claude-sonnet-4-5
    servers: [catalog, fulfillment]
    runs: 5
    prompt: >
      Find the current population of Paris and pull the source document for it.
    discovery:
      name_free: true
    equal_function_sets:
      classes:
        - name: search
          members:
            - catalog.search
            - catalog.web_search
        - name: fetch
          members:
            - fulfillment.fetch
            - fulfillment.get
      expect:
        - target: tool_selection.recall
          matcher:
            schema: { minimum: 100 }
    orchestration:
      expect:
        - target: orchestration.discovery
          matcher:
            schema: { minimum: 100 }
        - target: orchestration.syntax
          matcher:
            schema: { minimum: 100 }
        - target: orchestration.error_recovery
          matcher:
            schema: { minimum: 100 }
        - target: orchestration.efficiency
          matcher:
            schema: { minimum: 50 }

The referenced manifest servers/catalog.yml is a mock_server block:

mock_server:
  name: catalog
  tools:
    - name: search
      description: Search the catalog for matching records.
      input_schema:
        type: object
        required: [query]
        properties:
          query: { type: string }
      response:
        content:
          - type: text
            text: "Top result for ${args.query}: record-42."
    - name: web_search
      description: Search the public web for matching pages.
      input_schema:
        type: object
        required: [query]
        properties:
          query: { type: string }
      response:
        content:
          - type: text
            text: "Top web result for ${args.query}: https://example.com/record-42."

Each expect: item is the standard assertion shape: a target: paired with a matcher:. A floor on a percent is matcher: { schema: { minimum: N } }.

Objective, no model

Both the discovery score (equal-function-set recall) and all five orchestration diagnostics are computed from the recorded trace and the declared classes. None of them call a model, so they are free to run and byte-stable across platforms and providers. The scoring math is exercised offline by the core test:

cargo test -p mcptest-core --test orchestration_offline

The discovery: and orchestration: blocks are marked preview in the schema: the diagnostics engine computes the sub-scores today, and the runner wiring that emits them per scenario is still landing.