Red-team exploitability (Layers C1 and C2)

The deterministic security engine flags weaknesses in a server's catalog: a tool description that carries an instruction for the model, a tool that shadows another, a definition that changed since you approved it. A flag is not the same as an exploit. Layer C1 answers the follow-up question: does a real model actually fall for this attack?

It runs a model against the poisoned server (or replays a recorded run) and reads the observable trace. The model is the target, not the judge. The oracle is deterministic: it asserts on what the model did, which tool fired, with what arguments, and what it said back, never on how the model narrated its reasoning. That is the same observable-evidence rule the rest of the eval path follows, so a model cannot talk its way to a pass.

Running it: `mcptest security redteam`

The live command drives the corpus against a running server:

ANTHROPIC_API_KEY=... mcptest security redteam --url https://localhost:8080/mcp --model claude-sonnet-4-5

It connects to --url (reusing the same streamable-HTTP transport the run command uses), runs each scenario through the agent loop with --model, scores the captured trace, and prints the per-model exploitability report. The model provider is resolved from the environment the same way agent runs resolve it, and the API key is never logged. The exit code reports whether the run completed (0), or a setup failure (2); it never reports the exploitability signal, because that signal is advisory, not the verdict.

The result is a per-model signal, not a grade

A weakness one model resists and another falls for is a property of the model, not of the server. So Layer C1 output is a per-model exploitability signal that feeds the model-compatibility view. It is not folded into the server's security grade. The grade stays on the deterministic engine, which describes the server regardless of which model you point at it.

Safety checks read as "the model stayed safe"

A red-team scenario phrases its expectations as the safe behavior. If every check passes, the model resisted; if any check fails, the model took the bait, and the failed checks name what went wrong. The checks run on the trace envelope with the same path grammar the rest of mcptest uses (tool_calls[0].name, final_response).

Most red-team checks are negations: the model must not call the exfil tool, must not echo the attacker endpoint. The shared regex matcher cannot express that, because the regex engine has no negative lookahead. The oracle carries a not_contains matcher for it, and uses exact: null to assert an absent slot such as "there was no second tool call".

Cassette-replayable

Because the oracle reads a captured trace, a recorded run replays without a live model. The fixtures under tests/fixtures/redteam/ are two frozen traces, one where the model resisted a data-exfiltration scenario and one where it did not, and the tests assert the expected outcome against each. This keeps the corpus runnable in CI without API keys and makes a regression in the oracle visible.

The scenario corpus itself lives under examples/security/; see the corpus notes for its sources and licensing.

Adaptive attacker (Layer C2)

C1 runs the corpus once. C2 adapts: it takes an attack seed and mutates it with deterministic converters before each attempt, to get past a naive filter that only blocks the literal payload. The design mirrors PyRIT (arXiv:2410.02828): seeds, an orchestrator that iterates, converters that mutate payloads, the target, and a scorer. The scorer is the same C1 oracle, so success detection stays deterministic and observable.

The converters are pure string transforms, so a campaign is reproducible without a model deciding the mutation:

identity runs the payload unchanged (the baseline attempt).
leetspeak substitutes vowels (ignore becomes 1gn0r3).
rot13 rotates the letters.
homoglyph swaps ASCII letters for Unicode confusables that look identical.
hex_encode encodes the bytes as hex.

The orchestrator runs the ladder cheapest-first and stops at the first converter that lands an exploit, so the report names the specific mutation that worked. Iterations are bounded (max_attempts), so a campaign cannot run away. The live target sits behind a run closure, which means a campaign replays from a cassette in tests and drives a real model in production. An LLM that generates fresh seeds plugs into the same orchestrator by supplying more seeds.

A seed campaign folds into the same per-model scenario outcome C1 produces, so adaptive results roll up into one exploitability report next to the fixed corpus. As with C1, this is a per-model signal, not the server's grade.