Advisory LLM judge (Layer B)

The deterministic engine catches what a regex or a structural rule can see. Some threats need meaning, not pattern: a description that subtly steers the model, a docstring that does not match the tool's real behavior, a persuasive "always use this tool" pitch. Layer B runs an LLM judge over the definitions to flag those, the way Cisco's mcp-scanner runs an LLM analyzer beside its static engines.

Run this example. examples/security-tools-list.json is a captured catalog; passing --model turns on the advisory lane this page describes.

ANTHROPIC_API_KEY=... mcptest security examples/security-tools-list.json --model claude-sonnet-4-5

Advisory only

The judge never decides a security pass or fail. The deterministic engine owns the grade; the judge informs a human. That boundary is enforced in the code, not just the docs: advisory findings live in their own report type with no gating method, so a reporter cannot fold them into an authoritative verdict even by accident. Reporters label the advisory section so a reader never confuses it with the deterministic result.

What the judge looks for

Three semantic threat classes, the gaps a static scan leaves open:

tool-poisoning: a description that steers the model to act against the user, beyond what a pattern match would catch.
docstring-behavior-mismatch: the description claims behavior the schema does not support.
persuasive-manipulation: language pushing the model to over-prefer the tool.

The judge reads the definition itself, the observable artifact, not any narrated reasoning. That is the same evidence rule the rest of the eval path follows.

Confidence and escalation

Each finding carries a confidence band (high, medium, or low), reusing the calibration from the jury work. A low-confidence finding sets an escalation flag: it goes to a human for review rather than being acted on directly. This is how the advisory lane stays useful without overclaiming. A confident flag is worth reading; a coin-flip flag is worth a second pair of eyes.

Degrades safely

The live judge is injected as a closure, so the prompt building and response parsing are deterministic and fully tested, and a run replays from a cassette without a live model. When the judge call fails, the run records a judge error and produces no finding for that definition. A judge outage shrinks the advisory output; it never changes the grade, because the grade was never the judge's to move.