Scenario-world harness

Static prompt/answer evals and single tool-call checks miss the behavior that shows up in production: multi-turn recovery, hidden world state, invalid action handling, escalation, refusal, and state preservation across calls. The scenario-world harness exercises those directly.

A scenario declares a seeded hidden world, the transitions that mutate it, the transitions that are forbidden, and the deterministic oracles (golden trajectory, expected final state, assertions). mcptest scenario run replays a recorded trajectory against the world and reports the final-state diff, invalid actions, forbidden transitions, recovery attempts, escalations, refusals, and cost.

The engine is deterministic and never calls a model, so the same report comes from a cassette in CI as from a live run. The cassette path is the recommended CI tier: it needs no API key and no network.

mcptest scenario run examples/scenario-world/suite.yml

The model

A scenario lives under a top-level scenarios: block.

scenarios:
  - name: restock the widget shelf
    cassette: cassettes/restock.json
    seed:
      inventory:
        widgets: 3
      shelf_full: false
    transitions:
      - tool: add_widget
        effect:
          inventory.widgets: { inc: 1 }
      - tool: remove_widget
        when:
          inventory.widgets: { min: 1 }
        effect:
          inventory.widgets: { dec: 1 }
      - tool: mark_full
        effect:
          shelf_full: true
    forbidden:
      - tool: drop_inventory
        reason: destructive bulk delete is never allowed
    golden:
      calls: [add_widget, add_widget, mark_full]
    expect_state:
      inventory.widgets: 5
      shelf_full: true
    expect:
      - target: invalid_actions
        matcher: { exact: 0 }
      - target: golden.matched
        matcher: { exact: true }

Seed

seed: is the hidden world state, a JSON object. The run never sees it directly; the engine mutates a clone and diffs the result against the seed.

Transitions

Each transition maps a tool call to an effect: over dotted state paths. The effect value is one of:

Form	Meaning
a bare scalar/object, or `{ set: <value> }`	set the path to a literal
`{ inc: <n> }`	add `n` to the numeric value (absent counts as zero)
`{ dec: <n> }`	subtract `n`
`{ from_arg: "<dotted>" }`	set the path to the call argument at that path

An optional when: precondition guards the transition. Each entry maps a dotted state path to { eq: <value> }, { min: <n> }, { max: <n> }, or a bare literal (equality). A call whose guard fails is recorded as an invalid action and the effect is not applied. A call to a tool with no declared transition is also an invalid action (the agent invented an action).

Forbidden transitions

forbidden: lists tools that must never be called, each with a reason:. Forbidden rules are checked before legal transitions, so a forbidden call is recorded as a safety violation and never mutates the world. An optional when: guard scopes the rule to a state condition.

Refusals and escalations

refusal: { markers: [...] } flags a refusal when a recorded final response contains any marker (case-insensitive). escalation: { tools: [...], markers: [...] } counts an escalation for each call to an escalation tool and each final response containing an escalation marker.

Recovery attempts

A call whose immediately preceding call returned an error is counted as a recovery attempt, so a scenario can assert the agent retried after a failure.

Oracles

golden: declares the ideal calls: sequence plus acceptable alternates:. The report records an exact match, an alternate match, and the efficiency score of the recorded path. expect_state: is the primary deterministic oracle: a map of dotted path to the value the world must hold at the end. expect: runs the standard assertion grammar against the report envelope.

The report envelope

expect: assertions resolve against this object:

Target	Meaning
`turns`	number of turns (final responses)
`actions`	number of tool calls
`invalid_actions`	count of invalid actions
`forbidden_transitions`	count of forbidden-transition attempts
`recovery_attempts`	calls following an errored call
`escalations`	escalation count
`refusals`	refusal count
`state_matched`	true when every `expect_state` path matched
`golden.matched` / `golden.exact` / `golden.alternate` / `golden.penalty`	golden-trajectory outcome
`tool_names`	the recorded call names, in order
`state.<dotted>`	the final world state
`cost_micros`	run cost in micro-dollars, when the model has a known rate

Cassettes

The action trajectory comes from a recorded agent cassette (the same format mcptest record and the agent runner produce; see Cassettes). The engine reads each tool call's name, server, and arguments, the per-call error flag, the per-turn final responses, and the token totals. Point cassette: at a recording of your own agent to gate your own multi-turn, stateful tasks.

Judge and rubric evidence

A scenario may carry an optional rubric: for secondary judge evidence. The deterministic oracle (state diff, invalid actions, forbidden transitions) is always the primary signal; the rubric is evaluated only with a live provider and is recorded as deferred in the offline path.

CLI

mcptest scenario run <suite.yml> [--cassette-dir DIR] [--name NAME]... [--json]

--cassette-dir resolves cassette: paths against a directory other than the suite file's own directory.
--name runs only the named scenario (repeatable).
--json emits the structured reports instead of the human summary.

The command exits non-zero when any scenario has an invalid action, a forbidden transition, an expected-state mismatch, or a failed expect: assertion.