Tool-surface token efficiency

A large tool surface is not free. Every tool you advertise (its name, its description, its input schema) costs input tokens on every model call, and a wide catalog can drown the model in near-duplicate distractors. So "did the agent pick the right tool" (the selection F1) is only half the question. The other half is "how much did each correct selection cost" in tokens and dollars. A fat catalog that scores a fine F1 while burning tokens is inefficient, and the F1 alone hides that.

The token_efficiency: gate folds three numbers into one report: the selection F1, the serialized tool-surface token count, and the run cost. It exposes the per-correct price of the catalog as assertable targets, so a continuous integration suite can hold the line on both accuracy and efficiency at once.

What it measures

The gate scores the same equal-function selection F1 as the equal_function_sets: gate: you declare named classes of interchangeable tools, and a tool choice is correct when it lands in the right class. That part is deterministic and needs no model in the loop. On top of that, the gate folds in two cost signals:

Tool-surface tokens. The merged tool catalog the model actually saw is listed from the connected servers and tokenized the same way mcptest doctor counts a catalog (each tool's name, description, and inputSchema JSON, with the cl100k_base BPE tokenizer). This is the per-call input-token price of the surface.
Run cost. The dollars the run spent, summed across the cell's runs (real provider runs only; a free or key-free run reports no cost).

From those it derives the assertable token_efficiency.* targets:

Target	Meaning
`token_efficiency.f1`	Selection F1, integer percent `0..=100`.
`token_efficiency.tool_surface_tokens`	Tokens describing the tool surface.
`token_efficiency.correct_selections`	True positives pooled across runs.
`token_efficiency.cost`	Dollars spent on the run.
`token_efficiency.tokens_per_correct`	Tool-surface tokens per correct selection.
`token_efficiency.cost_per_correct`	Dollars per correct selection.

A letter grade rides alongside, keyed on the F1 (A at 90, B at 80, C at 70, D at 60, F below). The grade is a human-readable signal printed with the result; it is a letter, not a number, so you gate on token_efficiency.f1 instead.

The per-correct figures are undefined when nothing was selected correctly (no true positives), and cost_per_correct is undefined on a free run. An undefined figure resolves to absent rather than dividing by zero, so a reporter omits it.

Declaring the gate

The block reuses the equal_function_sets: shape: a required non-empty classes: list and an optional expect: over the token_efficiency.* targets. An empty or omitted expect: applies the default gate token_efficiency.f1 >= 50.

agents:
  - name: search stays efficient on a wide catalog
    model: claude-sonnet-4-5
    servers: [search-server]
    runs: 3
    prompt: >
      Find the document about quarterly revenue and read it back to me.
    token_efficiency:
      classes:
        - name: search
          members:
            - search
        - name: fetch
          members:
            - fetch
      expect:
        # Selection must not collapse on the padded catalog.
        - target: token_efficiency.f1
          matcher:
            schema: { minimum: 60 }
        # Each correct selection costs no more than 1500 tool-surface tokens.
        - target: token_efficiency.tokens_per_correct
          matcher:
            schema: { maximum: 1500 }

A runnable version of this suite is in examples/token-efficiency/.

Live-run only

The gate needs the live tool surface to count its tokens, so it only fires on a live run. A cassette replay carries no catalog token count (the recording stores only a fingerprint of the tool set, not the tools themselves), so the gate does not fire on replay and the other gates score the replayed trace as usual. Run the suite live to exercise the efficiency targets.

When to reach for it

Reach for token_efficiency: when you are tuning a tool catalog and want to keep accuracy from masking bloat: trimming verbose descriptions, removing near-duplicate tools, or splitting a monolithic server should move tokens_per_correct down without moving the F1. Gate on tokens_per_correct (or cost_per_correct when you run against a priced provider) to catch a catalog that grows wider over time while scoring the same.