Observability and eval-platform exports
mcptest is the MCP-specific source of truth for a run: which tools were called, which checks passed, what the judges decided, what it cost. The eval and observability market (Braintrust, LangSmith, Arize, Galileo, Patronus, plus any OpenTelemetry collector) speaks in traces and spans. The openinference export projects a run into that world so you keep one MCP-aware runner and still see results in the dashboards your team already uses, with no vendor SDK bundled into mcptest.
The export
openinference is a reporter, so it works from both surfaces:
# Straight from a run.
mcptest run tests.yaml --reporter openinference --output run-trace.jsonl
# Or re-render a saved JSON run later (no re-execution).
mcptest run tests.yaml --reporter json --output run.json
mcptest report run.json --format openinference > run-trace.jsonl
The output is JSONL: one span per line, each self-describing with a schema_version of mcptest.dev/openinference/v1. The shape reuses the run report and the pinned OpenTelemetry span-name conventions (mcptest.run, mcptest.test) rather than inventing a parallel trace model.
Span shape
A run becomes a small span tree:
mcptest.run(root,span_kind: CHAIN): one per run. Attributes carry the run id, mcptest version, and the pass/fail/skip/inconclusive totals.mcptest.test(child of the run, one per test): name, status, duration, cache hit, file/line, score, and compliancerule_idwhere present.mcptest.verdict(child of a test,span_kind: EVALUATOR, only for judged tests): the consensus verdict, method, score, escalation flag, and a per-juror array (model, provider, verdict, score, cost).
{"schema_version":"mcptest.dev/openinference/v1","trace_id":"d58a4d69...","span_id":"61cc95b2...","parent_span_id":null,"name":"mcptest.run","span_kind":"CHAIN","status_code":"OK","duration_ms":12,"attributes":{"mcptest.run.id":"019...","mcptest.run.total":3,"mcptest.run.passed":3,"mcptest.run.failed":0}}
{"schema_version":"mcptest.dev/openinference/v1","trace_id":"d58a4d69...","span_id":"01cfdb87...","parent_span_id":"61cc95b2...","name":"mcptest.test","span_kind":"CHAIN","status_code":"OK","duration_ms":3,"attributes":{"mcptest.test.name":"search returns a hit","mcptest.test.status":"pass"}}
trace_id (32 hex) and span_id (16 hex) are derived deterministically from the run id, so re-exporting the same run produces the same ids and a consumer can correlate runs and link back to the source.
status_code follows OpenTelemetry: OK for a pass, ERROR for a fail, UNSET for a skip or an inconclusive verdict. The run span is ERROR when any test failed.
Using it with eval and observability tools
The export is a neutral interchange format; each platform ingests it through its own collector or import path, with no mcptest-side coupling:
- OpenTelemetry collectors (Jaeger, Grafana Tempo, Honeycomb, ...): convert each line to an OTLP span (the field names align with the OTLP span model) and push through the collector's JSON/OTLP ingest.
- Arize / OpenInference-native tooling: the
span_kindvalues (CHAIN,EVALUATOR) and themcptest.*attributes follow OpenInference conventions, so spans land as chains with evaluator children. - Braintrust / LangSmith / Galileo / Patronus: map one
mcptest.testspan to an eval row and itsmcptest.verdictchild to the score, using each platform's import API. The per-juror array gives you per-grader detail.
Vendor-specific uploaders (pushing directly to a hosted endpoint) are deliberately out of the OSS core to avoid bundling vendor SDKs; the JSONL is the stable contract an adapter builds on.
Provenance and evidence
Pair the trace with an evidence pack: the pack carries the run's grades, coverage, and a signed digest, while the trace carries the per-span detail. Both key off the same run id, so a reviewer can move from a dashboard span to the signed governance artifact for the same run.
Stability
The span names, span_kind values, and schema_version are stable; new attributes may be added under the mcptest.* namespace without a version bump. A breaking change to the shape bumps mcptest.dev/openinference/vN.