Tool-selection F1 via equal-function sets

This gate scores whether an agent picked the right tool in a multi-server scenario, with no model in the loop. It works by grouping tools that do the same job into named classes, then reading precision, recall, and F1 straight off the recorded tool-call trace. The same trace always yields the same numbers, so the gate is deterministic, free to run, and safe to put in continuous integration.

Equal-function sets

An equal-function set is a named group of tools that accomplish the same job. Any member of the set is an acceptable choice. If a scenario can be solved by calling a web-search tool, and three different servers each expose a web-search tool, those three tools form one equal-function set. Selecting any one of them counts as a correct selection for that capability.

This is the idea from MSC-Bench (arXiv:2510.19423). Grouping interchangeable tools into a class lets a tool choice be judged correct when it lands in the right class, rather than when it matches one exact tool id. That matters in a multi-server setup where several servers offer the same capability under different names.

Because the check is "did an observed call land in the right class", it can be computed by reading the trace alone. There is no judge model, no token cost, and no cross-provider drift. Contrast this with a judged metric, where a model reads the agent's free-text answer and returns a verdict. A judged metric is not byte-stable (the same answer can score differently across runs or providers) and it costs tokens on every evaluation. The equal-function-set gate trades the ability to grade free-text answers for a number you can gate on in continuous integration.

Precision, recall, and F1

The three metrics are defined as follows. A true positive is a class the agent satisfied. A false positive is a tool call that satisfied no class. A false negative is a class the agent never satisfied.

Precision is true positives divided by the sum of true positives and false positives. It answers: of the tools the agent chose, how many were on target.
Recall is true positives divided by the sum of true positives and false negatives. It answers: of the capabilities the scenario required, how many the agent reached.
F1 is the harmonic mean of precision and recall, which is two times precision times recall, divided by the sum of precision and recall. It is high only when both precision and recall are high.

The exact counting rule

The implementation walks the observed tool calls against the declared classes and applies one rule. Each class is consumed at most once.

A class is a true positive the first time an observed call names a member of that class. Each class matches at most once.
A class with no observed member is a false negative.
An observed call that matches no class, and is not a repeat of a member of an already-matched class, is a false positive.
Repeated interchangeable members neither inflate nor penalize the score. Calling two members of the same class still counts as one true positive, and the second call is neither a true positive nor a false positive.

A member id with a server. prefix matches the full qualified id, so two servers that expose an identically named tool stay distinct. A bare tool id (no server. prefix) matches that tool on any server, which is convenient for a single-server scenario.

Percentages and edge cases

All three metrics are reported as integers from 0 to 100. Integer percents keep the output identical on every platform (no floating-point formatting drift) and match the floor convention the other agent gates use. The edge cases are pinned so the result is byte-stable:

No declared classes and no observed calls scores 100 on every metric. Nothing was expected and nothing was wrongly chosen, so the selection is vacuously perfect.
A zero denominator yields 0 for that metric rather than a not-a-number value.

A worked example

Declare the gate inside an agent test with an equal_function_sets: block. The block has a required classes: list and an optional expect: list. Each class has a name: and a members: list of server.tool ids.

tests:
  - name: research-agent picks search then fetch
    type: agent
    agent: researcher
    runs: 1
    equal_function_sets:
      classes:
        - name: search
          members:
            - brave.web_search
            - google.search
        - name: fetch
          members:
            - http.get
      expect:
        - tool_selection.f1: { ">=": 80 }

This scenario has two capabilities. The search class is satisfied by either brave.web_search or google.search. The fetch class is satisfied by http.get.

Suppose the run produces this tool-call trace. The scorer reads the same tool_calls: [{name, server}] shape the agent driver records, and builds a server.tool id from each entry.

{
  "tool_calls": [
    { "name": "web_search", "server": "brave" },
    { "name": "get", "server": "http" }
  ]
}

The two observed ids are brave.web_search and http.get. Scoring them against the two classes gives:

Class	Matched by	Outcome
search	brave.web_search	true positive
fetch	http.get	true positive

There are 2 true positives, 0 false positives, and 0 false negatives. Precision is 2 divided by 2, which is 100. Recall is 2 divided by 2, which is 100. F1 is

The tool_selection.f1: { ">=": 80 } floor passes.

Now suppose a second run misses the fetch capability and instead calls a tool that belongs to no class.

{
  "tool_calls": [
    { "name": "search", "server": "google" },
    { "name": "exec", "server": "shell" }
  ]
}

The observed ids are google.search and shell.exec. The search class is matched by google.search. The fetch class is never satisfied, so it is a false negative. The shell.exec call matches no class, so it is a false positive.

Class	Matched by	Outcome
search	google.search	true positive
fetch	(none)	false negative

Unexpected calls: shell.exec (false positive).

That gives 1 true positive, 1 false positive, and 1 false negative. Precision is 1 divided by 2, which is 50. Recall is 1 divided by 2, which is 50. F1 is 50. The report names fetch as the missed class and shell.exec as the unexpected tool, so the author sees exactly which capability the agent skipped and which extra tool it reached for.

The default gate

When the expect: list is omitted (or left empty), the block applies one default assertion: tool_selection.f1 >= 50. The reasoning is that a selection F1 under half is more wrong than right. The smallest useful form of the block is therefore just the classes: list:

equal_function_sets:
  classes:
    - name: search
      members: [brave.web_search, google.search]

That gates F1 at 50 with no further configuration.

Setting a precision, recall, or F1 floor

To gate a different metric, or to set a stricter floor, write an explicit expect: list. The three assertable targets are tool_selection.f1, tool_selection.precision, and tool_selection.recall. Each is an integer percent from 0 to 100, and each is gated with the standard matchers. To require high precision (few wrong tool calls) and a moderate F1:

equal_function_sets:
  classes:
    - name: search
      members: [brave.web_search, google.search]
    - name: fetch
      members: [http.get]
  expect:
    - tool_selection.precision: { ">=": 90 }
    - tool_selection.f1: { ">=": 70 }

When a test runs more than once (a runs: value above 1), the gate micro-averages across runs. It sums the true positives, false positives, and false negatives over every run, then derives the three percentages from those totals. The sums are order-independent, so a fixed set of run traces always produces the same aggregate.

When to use this gate

Reach for the objective F1 gate when you want a deterministic, free, gateable number for tool selection in a multi-server scenario. It is the right tool for continuous integration, where the same trace must score the same way every time and you cannot afford a token cost on every run.

Reach for the judged path instead when correctness lives in the agent's free-text answer, which no trace can confirm. That path uses a model as a judge to grade the final answer against a rubric, and is documented in LLM-as-judge evaluations. It costs a model call per scored answer and is not perfectly repeatable, so it does not belong in a gate that must be deterministic. The two checks are complementary: the objective F1 gate asserts that the agent reached the right tools, and the judged matcher asserts that the prose it produced is good.