Security vulnerability report

mcptest security runs the deterministic red-team catalog over a server's tool/prompt/resource surface and reports findings. Beyond the terminal (--format pretty), the machine-readable JSON (--format json), and SARIF for code scanning (--format sarif), two formats produce a vulnerability report you can hand to a reviewer:

--format html: a self-contained HTML report, one file, no assets. Findings are grouped by severity (critical first), each shown as a card with its rule id, the definition it is about, the message, the concrete evidence that fired it, the OWASP LLM Top 10 category it maps to, and a one-line remediation. An OWASP coverage table follows.
--format md: the same content as GitHub-flavored Markdown, for PR comments and job summaries.

mcptest security --snapshot tools.json --format html > security.html
mcptest security --snapshot tools.json --format md   >> "$GITHUB_STEP_SUMMARY"

OWASP LLM Top 10 coverage

Each check in the catalog declares external references, including OWASP LLM Top 10 identifiers (for example OWASP LLM01). The report builds a coverage table from those references: one row per OWASP category, showing whether the catalog addresses it, which SEC-NNN rules cover it, and how many findings fired in that category on this run.

The OWASP LLM Top 10 (2025) categories are: LLM01 Prompt Injection, LLM02 Sensitive Information Disclosure, LLM03 Supply Chain, LLM04 Data and Model Poisoning, LLM05 Improper Output Handling, LLM06 Excessive Agency, LLM07 System Prompt Leakage, LLM08 Vector and Embedding Weaknesses, LLM09 Misinformation, and LLM10 Unbounded Consumption.

The coverage table is built from the bundled per-definition catalog (the surface lane), which is the same catalog SARIF rule definitions come from. The relational lanes (namespace, integrity, toxic-flow, trust-propagation) also contribute findings to the report; their findings render without an OWASP tag when the firing rule is not in the per-definition catalog.

A category with no covering rule is a coverage gap worth a new probe. Read the table to see where the catalog is thin before trusting a clean run.

OWASP coverage scope

Not every OWASP category is addressable as a deterministic predicate over a static tool, prompt, or resource definition. The table below records which lane covers each category and, where a category is out of scope for a black-box definition check, why.

OWASP category	Covered by	Notes
LLM01 Prompt Injection	surface (SEC-001)	Imperative model-directed text in a description.
LLM02 Sensitive Information Disclosure	surface (SEC-003, SEC-008)	Exfiltration directives and embedded secrets.
LLM03 Supply Chain	integrity lane	Rug-pull and schema-drift checks compare a current catalog against an approved baseline. Not a single-definition check, so it lives outside the per-definition surface catalog.
LLM04 Data and Model Poisoning	out of scope (runtime)	Poisoning is a runtime data-flow concern: it depends on what untrusted content actually reaches the model or a state-changing tool at call time. A static definition cannot show it.
LLM05 Improper Output Handling	out of scope (runtime)	Whether tool output is encoded or validated before it is acted on is observable only when the call runs, not from the definition.
LLM06 Excessive Agency	surface (SEC-009), capability lane	Unannotated destructive tools, plus the toxic-flow source-to-sink pairing check.
LLM07 System Prompt Leakage	surface (SEC-037)	A prompt or resource that embeds a system-instruction block carrying a secret. Tool-description secrets are SEC-008.
LLM08 Vector and Embedding Weaknesses	out of scope	Embedding and retrieval behavior is not visible in a tool/prompt/resource definition; there is no static signal to match.
LLM09 Misinformation	out of scope (runtime)	Factuality and context-faithfulness are graded on actual answers, handled by the eval lane, not by a static surface check.
LLM10 Unbounded Consumption	surface (SEC-036), probes lane	A list-like tool that declares no bound parameter is flagged statically; unbounded-response behavior at run time is handled by the active probes.

The runtime categories (LLM04, LLM05, LLM09) and the embedding category (LLM08) are deliberately left without a surface check rather than approximated with a heuristic that would mostly produce false positives. Adding a noisy check that fires on ordinary servers would erode trust in the deterministic verdict, which the security engine keeps free of guesswork.