Metamorphic testing

Status: implemented behind the preview schema flag. Tracked as epic WOR-1236 and child WOR-1237. The relation engine and the runner that issues the transformed calls both ship; the metamorphic block is marked preview in the schema while the surface settles.

A matcher in mcptest answers one question: given these arguments, is this response correct. To ask it you have to know the correct response and write it down. That works for a tool with a stable answer. It does not work for a search, a summarizer, a recommender, or a ranking tool, where the right output drifts with the index, the model, and the day. Those tools get left untested, or pushed onto an llm-judge that costs money and is not deterministic.

Metamorphic testing removes the requirement. Instead of a golden output, you assert a relation between calls: a property that must hold between the output of one call and the output of a related call, whatever the actual values are. A search that returns the same results in a different order is a bug, and you can say so without knowing what the results should be. This is the standard answer to the test-oracle problem, brought to MCP tools by recent work (Multi-Agent LLM-based Metamorphic Testing for REST APIs, arXiv:2605.28321). MCP tool schemas look like OpenAPI operations, so the relations carry over.

The default comparison runs no model.

The relation catalog

Each relation pairs an input transform with the property the output must keep. All of these are deterministic.

idempotent: call the same arguments twice; the two results must be equal.
arg_order_insensitive: permute object keys, or the order of an array-of-filters argument; the result must not change.
noop_filter: add a filter that selects everything; the result set must not shrink.
subset_under_filter: add a real filter; the result must be a subset of the unfiltered result.
case_insensitive: lowercase the query; the result must not change.
whitespace_insensitive: collapse or pad surrounding whitespace; the result must not change.
monotone: raise a limit-style argument; the result set must not shrink.
duplicate_insensitive: append a duplicate of an element already in an array argument; a set-valued argument must ignore the redundant input, so the result must not change.
symmetric_args: swap two named arguments; an operation that is commutative in those two arguments (for example a "between A and B" search) must return an equal result.
limit_superset: raise a limit-style argument; the broadened result must contain every item the base result had. This is stricter than monotone, which checks only that the count did not shrink: a tool that returns more rows overall but silently drops one of the originals fails limit_superset while passing monotone.
substitution_insensitive: replace a documented substring with its synonym in a string argument ({ arg, from, to }); the tool must treat the paraphrase as equivalent and return an unchanged result. This is the soft oracle for doc-vs-behavior conformance: if the tool description promises two terms mean the same thing, this catches when the implementation disagrees.

A relation needs two runs and a comparison, nothing else. Equality reuses the same snapshot normalization pass the cassette path uses, so two recordings diff cleanly and an incidental field like a timestamp does not trip a false violation.

Named presets

Picking relations for a new tool is the hard part: which of the catalog above even applies? Two built-in presets bundle the relations that need no per-tool argument, so a tool with no stable golden output is testable out of the box. Name them under presets: instead of (or alongside) relations::

    metamorphic:
      presets: [universal, text]

universal = idempotent + arg_order_insensitive. These assume nothing about argument types or names and hold for any deterministic, key-order- insensitive tool, so they are the safe default for any tool.
text = case_insensitive + whitespace_insensitive. For a tool that normalizes textual input (search, lookup, NLU). Not universal: a tool that is deliberately case- or whitespace-sensitive will (correctly) violate these, so this bundle is opt-in by name rather than folded into universal.

Presets expand to the concrete relations above and merge with any explicit relations:, de-duplicated, so the report and the metamorphic.* targets are identical to having listed those relations by hand. The argument-naming relations (noop_filter, monotone, subset_under_filter, and the rest) stay explicit: they need a tool-specific argument, so no preset can guess them. The preset catalog is drawn from the metamorphic-testing survey (arXiv:2511.02108) reduced to the structurally-applicable, parameter-free relations.

A worked example

The base call searches for anthropic with two filters. Three relations are asserted against it.

tools:
  - name: search is order-insensitive and idempotent
    tool: search
    args: { query: "anthropic", filters: ["lang:en", "type:doc"] }
    metamorphic:
      relations:
        - idempotent
        - arg_order_insensitive
        - { noop_filter: { arg: filters, value: "*" } }
        - { monotone: { arg: limit } }

The runner makes the base call once, then makes one transformed call per relation: the same call again (idempotent), the call with filters reversed and query keys permuted (arg_order_insensitive), the call with a select-everything filter appended (noop_filter), and the call with limit raised (monotone). Each transformed result is compared against the base under the relation's property. Any deterministic relation that fails is a violation.

The gate

The gate fails the test on any violation of a deterministic relation. A violation is a concrete, reproducible disagreement (the same query returned a different result set when nothing that should matter changed), so it is safe to gate on without a flake budget.

Assertable targets

The check exposes three targets. The names are exact.

Target	Meaning
`metamorphic.relations_checked`	Count of relations evaluated.
`metamorphic.violations`	Count of relations that failed.
`metamorphic.gate_passed`	1 when no deterministic relation failed, 0 otherwise.

Write an explicit expect: to assert a target directly, or omit it to get the default gate (fail on any deterministic violation).

    metamorphic:
      relations: [idempotent, arg_order_insensitive]
      expect:
        - target: metamorphic.violations
          matcher: { schema: { maximum: 0 } }

The optional model-assisted relation

One relation, paraphrase_invariant, needs a model to generate a benign paraphrase of the input (a classification must not change when the prompt is reworded). It follows the same rule as the narrative-trace check:

The default is objective. The built-in relations run no model. You opt into the model path explicitly with paraphrase_invariant.
The CI gate never calls a model. The model-assisted relation is advisory signal in the report; it does not change what the gate decides, and it never makes a green run depend on a model call.

What it does not do

Metamorphic testing checks that a tool is self-consistent, not that it is good. A search that returns confidently wrong results for every query is consistent under every relation here and passes them all. So this is a floor, not a ceiling. Pair it with a few golden-output tests for the inputs you do know the answer to, and with llm-judge for subjective quality. What it buys you is the long tail: the thousands of inputs you will never write a golden answer for, gated deterministically and for free.