mcptest docs GitHub

Compositions

A composition is a declared DAG of tool calls that runs in topological order. The runner walks the graph, dispatches each node when its parents have finished, and emits a combined output plus a per-node trace. Compositions are the deterministic complement to agents:, where the LLM chooses tool order at run time, and the DAG generalization of a linear pipeline, where each step always feeds the next.

When to reach for a composition

For an emergent workflow where the LLM picks tools as it goes, use agents:. For a single tool call, use tools:.

Running a composition

mcptest run executes every compositions: block in the suite and gates the exit code on each composition's deterministic expect: assertions. A composition declaring cases: runs once per row, and each case is reported as its own pass/fail line.

mcptest run --config examples/composition-full.yml
composition [PASS] render-readme-with-auth-check [common-query]
composition [PASS] render-readme-with-auth-check [empty-query]
ran 2 composition(s): 2 passed, 0 failed

The judged eval: metrics defer when no model provider is configured, the same way a key-free evals: run defers. The expect: targets address a trace envelope: composition.ran, output.<path>, nodes.<id>.status, nodes.<id>.output, and cost.total.

Anatomy of a composition

compositions:
  - name: render-top-readme
    description: Search for a package and fetch its README in one DAG.
    nodes:
      - id: search
        tool: search_packages
        args:
          query: "mcp"
          limit: 1
      - id: readme
        needs: [search]
        tool: get_readme
        args:
          package: "${search.top_hit.name}"
    output: readme
    expect:
      - target: "composition.ran"
        matcher:
          exact: true
      - target: "nodes.readme.status"
        matcher:
          exact: "ok"
KeyMeaning
nameHuman-readable composition name surfaced to the reporter. Required.
descriptionOne-line description for mcptest list. Optional.
nodesComposition nodes in declaration order. Execution order is computed from needs: edges, not from this list.
outputComposition's combined output. Either a node id (simple form) or { assemble: <jq> } that fuses several nodes into one shape. When omitted, the executor uses the last node in topological order. See Output assembly.
expectDeterministic assertions evaluated against the combined output and the trace after the run.
casesOptional dataset that runs the composition once per row. Each row binds ${var.X} substitutions in node args. See Cases.
budgetOptional hard caps on total cost, tokens, and duration. See Budget.
evalOptional judged-metric block (rubric, mcp_task_completion, mcp_use, argument_correctness, plan_quality). Same surface as agents[].eval. See Judged output.

Each node has the following shape:

KeyMeaning
idStable identifier. Must be unique within the composition; referenced by other nodes in needs: and in ${id.field} templates. Required.
needsParents whose outputs this node consumes. A reference whose id is not in this list is a load-time error. Optional, defaults to [].
serverServer key resolved through the suite's servers: map. Mandatory when the suite declares more than one server.
toolMCP tool name. Required.
argsArguments forwarded to the tool. ${id.field} substitutes a field from a parent node's output; ${var.X} substitutes a case or CLI variable.
transformOptional jq program applied to the node's output before downstream nodes and the trace see it.
whenOptional jq predicate evaluated before dispatch. A falsy result skips the node. See Control flow.
for_eachOptional fan-out template ${parent} or ${parent[*]}. The node runs once per element of parent's output array, with ${item} substituting the current element. See Fan-out.
max_concurrencyPer-node concurrency cap for for_each. Defaults to 1 (sequential).
mockOptional inline fixture (mock: { result: ... }) that bypasses the live dispatch. See Determinism.

Data flow

References inside args: are resolved at dispatch time from the parent nodes' recorded outputs. Two whole-string-template behaviors matter:

Three template namespaces are reserved and do not count as DAG edges:

The runner passes these through unchanged so other epic children (cases, fan-out, env interpolation) can resolve them later in the pipeline.

Per-node transform

A node may carry transform.jq: <program> to reshape its output before downstream nodes and the trace see it. The jq program is compiled at load (so a typo fails the suite immediately), and applied to the raw tool result before the trace records the node's output and before any downstream ${id.field} resolves.

nodes:
  - id: search
    tool: search_packages
    args: { query: "mcp" }
    transform:
      jq: ".result.hits[0] | { name, score }"
  - id: readme
    needs: [search]
    tool: get_readme
    args:
      package: "${search.name}"

Here the search node's raw envelope ({"result": {"hits": [...]}}) is reshaped to {"name": ..., "score": ...} before the downstream ${search.name} reference resolves.

Control flow

A node may carry when: <jq-predicate> to gate its dispatch on a runtime decision. The predicate is compiled at load (so a typo fails the suite immediately) and evaluated against a context built from the node's parent outputs:

nodes:
  - id: search
    tool: search_packages
    args: { query: "mcp" }
  - id: readme
    needs: [search]
    when: ".parents.search.count > 0"
    tool: get_readme
    args:
      package: "${search.top_hit.name}"

The context the predicate sees is {parents: {<id>: <output>, ...}}, so a predicate references parent outputs as .parents.<id>.<field>.

Truthiness

The runner follows jq's truthiness rule: every value is truthy except false and null. Numbers (including 0), empty strings, empty arrays, and empty objects are all truthy. A common pitfall is .count where an absent count field returns null and the predicate evaluates falsy; assert the field explicitly with (.parents.search.count // 0) > 0 to make the default explicit.

Skip propagation

When a node skips (when returned falsy) or fails (tools/call returned an error), every descendant whose needs: list includes that node also skips. The descendant appears in the trace with status: skipped and output: null. A test that needs an explicit "this branch did not run" check can address nodes.<id>.status == "skipped".

This is the only propagation rule v1 ships: every needs: entry is treated as required. Marking a parent optional (so a child runs with a documented default when the parent skipped) is a follow-up.

Fan-out

A node may carry for_each: "${parent}" (or ${parent[*]}) to fan out into one sub-call per element of the parent's output array. The two forms are equivalent: both name a parent whose output is an array and run the node body once per element. The [*] form is sugar that mirrors how a downstream node will later reference ${id[*].field} to collect across the fan-out.

nodes:
  - id: search
    tool: search_packages
    args: { query: "mcp", limit: 3 }
    transform:
      jq: ".result.hits"
  - id: enrich
    needs: [search]
    for_each: "${search}"
    tool: fetch_metadata
    args:
      name: "${item.name}"

Inside the sub-call's args, ${item} substitutes the current element. The node's trace records the per-iteration outputs as a single array (input order, not completion order):

"nodes": {
  "enrich": {
    "id": "enrich",
    "status": "ok",
    "output": [{"meta": "first"}, {"meta": "second"}, {"meta": "third"}]
  }
}

Edge cases

Concurrency

The per-node max_concurrency: field bounds how many sub-calls may dispatch in parallel. The default is 1 (sequential) so the runner never silently multiplies wall-clock load on a user's test server. The parallel scheduler that honors a max_concurrency > 1 is a deliberate follow-up; today the field validates and is recorded on the resolved node but the executor runs iterations one at a time.

Cases

A composition may carry a cases: dataset so the same DAG runs once per row against a different ${var.X} binding. Each row is a named bundle of variables, optional golden value, and optional per-case assertion overrides.

compositions:
  - name: parametric
    nodes:
      - id: search
        tool: search_packages
        args:
          q: "${var.query}"
          limit: "${var.limit}"
    output: search
    cases:
      - name: short
        vars: { query: "mcp", limit: 1 }
      - name: long
        vars: { query: "test framework", limit: 10 }
      - name: golden
        vars: { query: "rubric", limit: 5 }
        golden:
          hits: []
        expect:
          - target: "nodes.search.output.hits"
            matcher: { exact: [] }

The runner produces one trace per case. Inside node args, ${var.<key>} and ${var.<key>.<field>} resolve against the case's variable map; everything else (${id.field}, ${item}, ${env.X}) keeps its usual semantics. ${var.X} outside a case context (no dataset declared) passes through unchanged so a single-run composition is backwards-compatible.

Planned follow-up: filtering and aggregation

mcptest run --case <name> for selecting one row and the per-case aggregation in the reporter are the next follow-ups. Until they ship, every case row runs and the trace exposes the per-case inputs and outputs for downstream reporting tools to aggregate.

Output assembly

The composition's combined output rides on the trace as output.*. Two shapes are supported:

# Single-node form: select one node's output verbatim.
output: readme
# Assembly form: fuse several nodes into one shape via jq.
output:
  assemble: "{ stars: .nodes.stars.count, release: .nodes.latest.tag }"

The assembly program runs at the end of the run against a context shaped {nodes: {<id>: <output>, ...}}, so a program can reach into any node's recorded output. The result becomes output.* for the assertion pipeline, so a downstream expect block can address output.stars, output.release, and so on against the assembled shape.

The assembly jq is compiled at load (so a typo fails the suite immediately) and re-runs at the end of every run. When omitted, the executor falls back to "use the last node in topological order".

LLM reduce node

The assembly form is the simple, deterministic option. A composition can also reduce several inputs through an LLM by adding a tool node that consumes the upstream branches and writes the combined output. That is the same pattern as any LLM-backed tool, exposed over MCP; nothing in the composition surface treats it specially. Wire the reducer like any other node, set its needs: to the branches it fuses, and either point output: <reducer> at it or hand the work to output.assemble: if the reduction is shape-only.

Judged output

A composition can carry the same eval: block agent tests use, so the combined output and trace are graded by the LLM-jury / rubric pipeline. Each entry produces its own pass/fail row in the report and gates the run exit code; with no LLM provider set, the metrics defer cleanly so a key-free CI stays green.

compositions:
  - name: judged
    nodes:
      - id: search
        tool: search_packages
        args: { query: "${var.query}" }
    output: search
    eval:
      mcp_task_completion: { threshold: 0.8 }
      mcp_use: { threshold: 0.7 }
      rubric:
        criteria:
          - { name: "covers query", weight: 1.0 }

The supported entries are the same five built-ins agent tests use: rubric (flat criteria: or a decision tree:), mcp_task_completion, mcp_use, argument_correctness, and plan_quality. The composition runner produces the combined output via Output assembly and feeds it, together with the per-node trace, into the same scoring path agent runs use; the report keeps a single shape across agent and composition rows so a CI dashboard does not need to know the difference.

Planned follow-up: per-case eval

When a composition declares both cases: and eval:, every case will get its own per-row eval; the aggregate view lands with the case-aggregation work in the reporter. Today the resolver carries the typed metric list on every case so the eval engine can light up without further YAML changes.

Budget

A composition multiplies spend fast: cases x runs x fan-out x nodes. The budget: block is a first-class gate that aborts the run the first time the accumulated total crosses a cap, instead of discovering a five-dollar oopsie on the invoice. Every field is optional; an unset field disables that dimension's gate.

compositions:
  - name: gated
    budget:
      max_cost_usd: 0.50
      max_tokens: 20000
    nodes:
      - id: search
        tool: search_packages
        max_cost_per_call_usd: 0.05
        max_tokens_per_call: 1000

The trace records every node's cost_usd, tokens, and model attribution (populated by LLM-backed adapters; tool calls with no attribution contribute zero). The top-level composition.cost_total and composition.tokens_total are the sums, addressable as cost.total and tokens.total in assertions:

expect:
  - target: "cost.total"
    matcher: { lte: 0.50 }
  - target: "tokens.total"
    matcher: { lte: 20000 }

When any cap fires, composition.budget_tripped carries the name of the cap that fired (max_cost_usd, max_tokens, or max_duration_ms) and the verdict fails. The reporter surfaces the cap name so a CI gate fails with a clear pointer at spend.

Planned follow-up: projection, baselines, attribution

The runtime trace surface (per-node cost_usd / tokens / model, top-level cost_total / tokens_total, and budget_tripped) is the foundation those features all read from, so they land as additive follow-ups without breaking the contract shipped here.

Determinism

Determinism is per-node, not per-composition. A composition can mix live and frozen nodes so the expensive or flaky parts (the LLM, a paid API) run once against a fixture while the cheap and stable parts still exercise the live server.

The simplest form is mock: { result: ... }. When a node declares a mock, the executor records the fixture as the node's output without touching the transport. The status stays ok, the duration_ms is reported as 0, and downstream ${id.field} references resolve against the mock value exactly as they would against a real response.

nodes:
  - id: search
    tool: search_packages
    args: { query: "mcp" }
    mock:
      result:
        top_hit:
          name: "mcptest"
          score: 0.9
  - id: readme
    needs: [search]
    tool: get_readme
    args:
      package: "${search.top_hit.name}"
    mock:
      result:
        body: "# mcptest"

The mock value is reshaped by the node's transform.jq: when one is declared, so a frozen node and a live node are interchangeable from downstream's point of view. A mock-only suite runs entirely offline.

Planned follow-up: freeze + cassette

freeze: golden and freeze: record for capture-once-replay-forever behavior are the next determinism layer. The full-graph cassette extends the existing mcptest-cassette format with a node-keyed layout so the same suite that authored against the live transport today replays byte-stable tomorrow. Until that ships, the mocked suite plus the deterministic DAG cover the "freeze the expensive node" use case end-to-end.

Planned follow-up: bounded loops

loop: / until: / max: for a bounded "re-run this sub-path until the predicate holds" construct is on the roadmap. The composition budget, the per-loop budget, and the loop-iteration trace are part of that scope and ship together so a loop never quietly blows the budget. Until then, the deterministic DAG plus when: covers the common "run this branch only if the previous step succeeded" case.

Edges and references

Edges are explicit: each node lists its parents in needs:, and a ${id.field} reference in args: whose id is not in needs: is a load-time error. The intent is that a broken graph fails at load, not at 2am. Conversely, a needs: entry that is never referenced is a warning, not an error: the executor still treats it as a control-only ordering edge, but the lint asks whether that was intentional.

Three template namespaces are reserved and do not count as DAG edges:

Trace shape

After every node has either run or been marked skipped, the executor emits a trace the assertion pipeline can address with dotted paths.

Load-time validation

The loader rejects a composition for any of the following reasons:

Warnings (non-blocking) include:

Worked example

A runnable example ships at examples/composition-pipeline.yml.