Scenario-world harness
Static prompt/answer evals and single tool-call checks miss the behavior that shows up in production: multi-turn recovery, hidden world state, invalid action handling, escalation, refusal, and state preservation across calls. The scenario-world harness exercises those directly.
A scenario declares a seeded hidden world, the transitions that mutate it, the transitions that are forbidden, and the deterministic oracles (golden trajectory, expected final state, assertions). mcptest scenario run replays a recorded trajectory against the world and reports the final-state diff, invalid actions, forbidden transitions, recovery attempts, escalations, refusals, and cost.
The engine is deterministic and never calls a model, so the same report comes from a cassette in CI as from a live run. The cassette path is the recommended CI tier: it needs no API key and no network.
mcptest scenario run examples/scenario-world/suite.yml
The model
A scenario lives under a top-level scenarios: block.
scenarios:
- name: restock the widget shelf
cassette: cassettes/restock.json
seed:
inventory:
widgets: 3
shelf_full: false
transitions:
- tool: add_widget
effect:
inventory.widgets: { inc: 1 }
- tool: remove_widget
when:
inventory.widgets: { min: 1 }
effect:
inventory.widgets: { dec: 1 }
- tool: mark_full
effect:
shelf_full: true
forbidden:
- tool: drop_inventory
reason: destructive bulk delete is never allowed
golden:
calls: [add_widget, add_widget, mark_full]
expect_state:
inventory.widgets: 5
shelf_full: true
expect:
- target: invalid_actions
matcher: { exact: 0 }
- target: golden.matched
matcher: { exact: true }
Seed
seed: is the hidden world state, a JSON object. The run never sees it directly; the engine mutates a clone and diffs the result against the seed.
Transitions
Each transition maps a tool call to an effect: over dotted state paths. The effect value is one of:
| Form | Meaning |
|---|---|
a bare scalar/object, or { set: <value> } | set the path to a literal |
{ inc: <n> } | add n to the numeric value (absent counts as zero) |
{ dec: <n> } | subtract n |
{ from_arg: "<dotted>" } | set the path to the call argument at that path |
An optional when: precondition guards the transition. Each entry maps a dotted state path to { eq: <value> }, { min: <n> }, { max: <n> }, or a bare literal (equality). A call whose guard fails is recorded as an invalid action and the effect is not applied. A call to a tool with no declared transition is also an invalid action (the agent invented an action).
Forbidden transitions
forbidden: lists tools that must never be called, each with a reason:. Forbidden rules are checked before legal transitions, so a forbidden call is recorded as a safety violation and never mutates the world. An optional when: guard scopes the rule to a state condition.
Refusals and escalations
refusal: { markers: [...] } flags a refusal when a recorded final response contains any marker (case-insensitive). escalation: { tools: [...], markers: [...] } counts an escalation for each call to an escalation tool and each final response containing an escalation marker.
Recovery attempts
A call whose immediately preceding call returned an error is counted as a recovery attempt, so a scenario can assert the agent retried after a failure.
Oracles
golden: declares the ideal calls: sequence plus acceptable alternates:. The report records an exact match, an alternate match, and the efficiency score of the recorded path. expect_state: is the primary deterministic oracle: a map of dotted path to the value the world must hold at the end. expect: runs the standard assertion grammar against the report envelope.
The report envelope
expect: assertions resolve against this object:
| Target | Meaning |
|---|---|
turns | number of turns (final responses) |
actions | number of tool calls |
invalid_actions | count of invalid actions |
forbidden_transitions | count of forbidden-transition attempts |
recovery_attempts | calls following an errored call |
escalations | escalation count |
refusals | refusal count |
state_matched | true when every expect_state path matched |
golden.matched / golden.exact / golden.alternate / golden.penalty | golden-trajectory outcome |
tool_names | the recorded call names, in order |
state.<dotted> | the final world state |
cost_micros | run cost in micro-dollars, when the model has a known rate |
Cassettes
The action trajectory comes from a recorded agent cassette (the same format mcptest record and the agent runner produce; see Cassettes). The engine reads each tool call's name, server, and arguments, the per-call error flag, the per-turn final responses, and the token totals. Point cassette: at a recording of your own agent to gate your own multi-turn, stateful tasks.
Judge and rubric evidence
A scenario may carry an optional rubric: for secondary judge evidence. The deterministic oracle (state diff, invalid actions, forbidden transitions) is always the primary signal; the rubric is evaluated only with a live provider and is recorded as deferred in the offline path.
CLI
mcptest scenario run <suite.yml> [--cassette-dir DIR] [--name NAME]... [--json]
--cassette-dirresolvescassette:paths against a directory other than the suite file's own directory.--nameruns only the named scenario (repeatable).--jsonemits the structured reports instead of the human summary.
The command exits non-zero when any scenario has an invalid action, a forbidden transition, an expected-state mismatch, or a failed expect: assertion.