Policy simulator

mcptest policy simulate is a local governance gate. It reads a small declarative policy file plus whichever mcptest artifacts you already have on disk, extracts named facts from each artifact, evaluates the policy rules against those facts, applies any waivers, and prints a pass / warn / fail verdict with a cited line per rule. It runs entirely offline: no network, no hosted collector, no live server. The same artifacts your pipeline already produces become the inputs to a release gate you can read in one sitting.

This is the open-source, single-developer half of governance. It does not store results, manage approvals across a team, or enforce anything centrally; it evaluates a policy you commit next to your tests and returns an exit code your CI can block on.

The command

mcptest policy simulate --policy policy.yml \
  --run-report run.json \
  --judge-cert cert.json \
  --conformance-report conformance.json \
  --security security.json \
  --model-compat model-compat.json \
  --evidence evidence.json

Only --policy is required. Pass whichever artifact flags you have; each one adds its facts to the pool the rules evaluate against. A rule whose fact has no backing artifact is reported as unevaluated and fails the gate, so a missing input never silently passes (more on that below).

Flag	Artifact	Produced by
`--policy`	The policy file itself (YAML)	You author it. See `examples/policy/policy.yml`.
`--run-report`	Run report (JSON)	`mcptest run --reporter json --output run.json`
`--judge-cert`	Judge certification record (JSON)	`mcptest judge certify --output cert.json`
`--conformance-report`	Conformance score (JSON)	`mcptest conformance run --format json`
`--security`	Security scan (JSON)	`mcptest security tools-list.json --format json`
`--model-compat`	Model-compatibility diff (JSON)	`mcptest model-compat diff --format json`
`--evidence`	Evidence artifact (JSON)	`mcptest evidence run.json --out evidence.json`
`--gate`	(no artifact)	Turns a failing verdict into a non-zero exit code.
`--format`	(no artifact)	`pretty` (default) or `json`.

The policy file

A policy is a tiny YAML document: a version string, a list of rules, and an optional list of waivers. There is no expression language on purpose. Each rule names exactly one fact and one comparator, so anyone reading the file can see at a glance what gates the release.

version: "1.0"
rules:
  - id: no-failed-tests
    description: Every test in the run must pass.
    fact: run.failed
    max: 0
  - id: judge-certified
    description: The grading judge must be certified.
    fact: judge.certified
    equals: true
  - id: conformance-tier
    description: The server must reach conformance tier 1 or 2.
    fact: conformance.badge
    one_of: [T1, T2]
    severity: warn
waivers:
  - rule: conformance-tier
    owner: platform-team
    reason: A known SHOULD gap tracked upstream.
    expiry: "2099-01-01T00:00:00Z"
    issue: GH-1234

The full worked example lives at examples/policy/policy.yml.

Rules

Each rule has:

id: a stable identifier, cited in the report and matched by waivers.
description (optional): a human note shown when you author the rule.
fact: the fact name the rule constrains (see the catalog below).
exactly one comparator:
- max: the fact (a number) must be less than or equal to this value.
- min: the fact (a number) must be greater than or equal to this value.
- equals: the fact must equal this literal. The comparison follows the literal's type: equals: true compares a boolean fact, equals: 0 a numeric fact, equals: "T1" a textual fact.
- one_of: the fact (rendered to text) must be one of the listed values.
severity (optional, defaults to fail): fail means a failing rule fails the gate; warn means a failing rule only warns and never fails the gate.

Setting zero or more than one comparator on a rule makes it unevaluated, which fails the gate, so a malformed rule is loud rather than silent.

Waivers

A waiver suppresses one rule's failure until it expires:

rule: the rule id it suppresses.
owner: who owns the waiver, so reviewers know whom to ask.
reason: why the failure is tolerated, captured for the audit trail.
expiry: an RFC 3339 UTC timestamp, for example 2026-12-31T00:00:00Z.
issue (optional): a tracking reference such as a GitHub issue id.

Waivers fail closed. While a waiver is active, a failing rule is reported as waived and does not fail the gate. Once the expiry passes (or if the expiry does not parse), the waiver no longer suppresses anything: the rule is reported as expired-waiver and the gate fails. A waiver is a dated promise to fix something, not a permanent exception, and the simulator enforces the date.

The fact catalog

Each artifact flag contributes a fixed set of facts. A fact that an artifact does not carry is simply absent (and any rule referencing it is unevaluated), except the five security severity counts, which are always emitted (zero when a clean scan had no findings of that severity) so a max: 0 rule on them works even against a clean scan.

Fact	Type	Source artifact
`run.total`	number	`--run-report`
`run.passed`	number	`--run-report`
`run.failed`	number	`--run-report`
`run.skipped`	number	`--run-report`
`run.inconclusive`	number	`--run-report` (when present)
`judge.certified`	boolean	`--judge-cert`
`judge.ece`	number	`--judge-cert` (expected calibration error)
`judge.brier`	number	`--judge-cert` (Brier score)
`judge.expired`	boolean	`--judge-cert` (computed from the certification's validity window versus the current time)
`conformance.badge`	text	`--conformance-report` (`T1` / `T2` / `T3` / `F`)
`conformance.must_passed`	number	`--conformance-report` (MUST checks passed)
`conformance.must_total`	number	`--conformance-report` (MUST checks total)
`conformance.should_passed`	number	`--conformance-report` (SHOULD checks passed)
`conformance.should_total`	number	`--conformance-report` (SHOULD checks total)
`conformance.tier`	text	`--conformance-report`
`security.critical_count`	number	`--security`
`security.high_count`	number	`--security`
`security.medium_count`	number	`--security`
`security.low_count`	number	`--security`
`security.info_count`	number	`--security`
`security.total_findings`	number	`--security`
`model_compat.total`	number	`--model-compat`
`model_compat.pass`	number	`--model-compat`
`model_compat.drift`	number	`--model-compat`
`model_compat.fail`	number	`--model-compat`
`evidence.reproducible`	boolean	`--evidence`
`evidence.unverifiable_origin`	boolean	`--evidence`

Verdict and exit codes

Every rule resolves to one of: pass, fail, warn, waived, expired-waiver, or unevaluated. The overall verdict is the worst outcome across all rules:

fail if any rule is fail, expired-waiver, or unevaluated.
otherwise warn if any rule is warn.
otherwise pass.

waived and pass rules never fail the gate.

Exit codes follow the verdict only when you ask for a gate:

Without --gate the command is a dry run: it always exits 0 and just prints the verdict. This is useful for showing the report without blocking.
With --gate the command exits 1 when the verdict is fail, and 0 otherwise. This is what you wire into CI.

Two behaviors are worth restating because they keep the gate honest:

An expired waiver fails closed. A stale promise to fix something stops suppressing the failure the moment it expires.
A missing fact is unevaluated and fails. If a rule references a fact and you did not pass the artifact that provides it, the gate fails rather than pretending the check passed.

How it relates to evidence and judge certification

The simulator sits one layer above the artifacts. The evidence artifact (mcptest evidence) bundles a run's metadata into one portable, signable file; the policy simulator can read that file and gate on its reproducible and unverifiable_origin flags alongside everything else. Judge certification (mcptest judge certify) proves a grading judge is calibrated before its verdict may gate; the simulator turns that proof into a gate condition with the judge.certified and judge.expired facts, so a release can require that the judge behind its evals was certified and that the certification has not gone stale. In short, the other commands produce trustworthy artifacts; the policy simulator decides, locally and reproducibly, whether those artifacts clear the bar you set.

See examples/policy/policy.yml for a starting point you can copy and trim.