mcptest docs GitHub

Policy simulator

mcptest policy simulate is a local governance gate. It reads a small declarative policy file plus whichever mcptest artifacts you already have on disk, extracts named facts from each artifact, evaluates the policy rules against those facts, applies any waivers, and prints a pass / warn / fail verdict with a cited line per rule. It runs entirely offline: no network, no hosted collector, no live server. The same artifacts your pipeline already produces become the inputs to a release gate you can read in one sitting.

This is the open-source, single-developer half of governance. It does not store results, manage approvals across a team, or enforce anything centrally; it evaluates a policy you commit next to your tests and returns an exit code your CI can block on.

The command

mcptest policy simulate --policy policy.yml \
  --run-report run.json \
  --judge-cert cert.json \
  --conformance-report conformance.json \
  --security security.json \
  --model-compat model-compat.json \
  --evidence evidence.json

Only --policy is required. Pass whichever artifact flags you have; each one adds its facts to the pool the rules evaluate against. A rule whose fact has no backing artifact is reported as unevaluated and fails the gate, so a missing input never silently passes (more on that below).

FlagArtifactProduced by
--policyThe policy file itself (YAML)You author it. See examples/policy/policy.yml.
--run-reportRun report (JSON)mcptest run --reporter json --output run.json
--judge-certJudge certification record (JSON)mcptest judge certify --output cert.json
--conformance-reportConformance score (JSON)mcptest conformance run --format json
--securitySecurity scan (JSON)mcptest security tools-list.json --format json
--model-compatModel-compatibility diff (JSON)mcptest model-compat diff --format json
--evidenceEvidence artifact (JSON)mcptest evidence run.json --out evidence.json
--gate(no artifact)Turns a failing verdict into a non-zero exit code.
--format(no artifact)pretty (default) or json.

The policy file

A policy is a tiny YAML document: a version string, a list of rules, and an optional list of waivers. There is no expression language on purpose. Each rule names exactly one fact and one comparator, so anyone reading the file can see at a glance what gates the release.

version: "1.0"
rules:
  - id: no-failed-tests
    description: Every test in the run must pass.
    fact: run.failed
    max: 0
  - id: judge-certified
    description: The grading judge must be certified.
    fact: judge.certified
    equals: true
  - id: conformance-tier
    description: The server must reach conformance tier 1 or 2.
    fact: conformance.badge
    one_of: [T1, T2]
    severity: warn
waivers:
  - rule: conformance-tier
    owner: platform-team
    reason: A known SHOULD gap tracked upstream.
    expiry: "2099-01-01T00:00:00Z"
    issue: GH-1234

The full worked example lives at examples/policy/policy.yml.

Rules

Each rule has:

Setting zero or more than one comparator on a rule makes it unevaluated, which fails the gate, so a malformed rule is loud rather than silent.

Waivers

A waiver suppresses one rule's failure until it expires:

Waivers fail closed. While a waiver is active, a failing rule is reported as waived and does not fail the gate. Once the expiry passes (or if the expiry does not parse), the waiver no longer suppresses anything: the rule is reported as expired-waiver and the gate fails. A waiver is a dated promise to fix something, not a permanent exception, and the simulator enforces the date.

The fact catalog

Each artifact flag contributes a fixed set of facts. A fact that an artifact does not carry is simply absent (and any rule referencing it is unevaluated), except the five security severity counts, which are always emitted (zero when a clean scan had no findings of that severity) so a max: 0 rule on them works even against a clean scan.

FactTypeSource artifact
run.totalnumber--run-report
run.passednumber--run-report
run.failednumber--run-report
run.skippednumber--run-report
run.inconclusivenumber--run-report (when present)
judge.certifiedboolean--judge-cert
judge.ecenumber--judge-cert (expected calibration error)
judge.briernumber--judge-cert (Brier score)
judge.expiredboolean--judge-cert (computed from the certification's validity window versus the current time)
conformance.badgetext--conformance-report (T1 / T2 / T3 / F)
conformance.must_passednumber--conformance-report (MUST checks passed)
conformance.must_totalnumber--conformance-report (MUST checks total)
conformance.should_passednumber--conformance-report (SHOULD checks passed)
conformance.should_totalnumber--conformance-report (SHOULD checks total)
conformance.tiertext--conformance-report
security.critical_countnumber--security
security.high_countnumber--security
security.medium_countnumber--security
security.low_countnumber--security
security.info_countnumber--security
security.total_findingsnumber--security
model_compat.totalnumber--model-compat
model_compat.passnumber--model-compat
model_compat.driftnumber--model-compat
model_compat.failnumber--model-compat
evidence.reproducibleboolean--evidence
evidence.unverifiable_originboolean--evidence

Verdict and exit codes

Every rule resolves to one of: pass, fail, warn, waived, expired-waiver, or unevaluated. The overall verdict is the worst outcome across all rules:

waived and pass rules never fail the gate.

Exit codes follow the verdict only when you ask for a gate:

Two behaviors are worth restating because they keep the gate honest:

How it relates to evidence and judge certification

The simulator sits one layer above the artifacts. The evidence artifact (mcptest evidence) bundles a run's metadata into one portable, signable file; the policy simulator can read that file and gate on its reproducible and unverifiable_origin flags alongside everything else. Judge certification (mcptest judge certify) proves a grading judge is calibrated before its verdict may gate; the simulator turns that proof into a gate condition with the judge.certified and judge.expired facts, so a release can require that the judge behind its evals was certified and that the certification has not gone stale. In short, the other commands produce trustworthy artifacts; the policy simulator decides, locally and reproducibly, whether those artifacts clear the bar you set.

See examples/policy/policy.yml for a starting point you can copy and trim.