mcptest docs GitHub

Scenario 5: three-stage CI quality gate

A single "run every test on every push" CI job works until it doesn't. As the suite grows, developers get used to broken pre-push runs and stop reading the output; the build queue gets backed up behind the slow tests; the team loses the feedback loop that made the suite worth writing.

The three-stage pattern uses per-test tags: plus the --tag and --skip-tag flags to run the smallest meaningful set of tests at each stage of the dev loop. The fastest tests run earliest. The expensive tests run least often.

The three stages

StageTagTriggered byWall budget
Pre-commit smokequicklocal pre-commit hook, on every commitunder 5 seconds
PR standardstandardevery push to a PR branchunder 30 seconds
Build fullfullmerges to main, nightlyas long as it needs

quick catches the typos and the structural breakages. standard catches the regressions that a reasonable developer might miss but that a reviewer would. full catches the slow integration paths, the eval tests, and the edge cases that are too expensive to run on every push.

The YAML

Give each test a tags: list naming the stage it belongs to. At each stage, mcptest run --tag <name> keeps only the tests carrying that tag, and --skip-tag <name> drops a tag from a wider run.

# yaml-language-server: $schema=https://mcptest.sh/schema/v1.json

servers:
  local:
    command: ["./target/debug/my-mcp-server"]

tools:
  - name: "smoke: initialize and list_tools"
    server: local
    tool: "list_tools"
    tags: ["quick"]
    expect:
      - target: "result.tools"
        matcher: { schema: { type: array, minItems: 1 } }

  - name: "standard: lookup happy path"
    server: local
    tool: "lookup_record"
    args: { id: "rec_abc123" }
    expect:
      - target: "result.content[0].text"
        matcher: { contains: "rec_abc123" }

  - name: "standard: lookup error path"
    server: local
    tool: "lookup_record"
    args: { id: "rec_missing" }
    expect:
      - target: "result.isError"
        matcher: { exact: true }

  - name: "full: large dataset under load"
    server: local
    tool: "stress_test"
    tags: ["full"]
    args: { records: 10000 }
    expect:
      assertions:
        - target: "result.isError"
          matcher: { exact: false }
      max_duration_ms: 30000

The CI workflow

# .github/workflows/mcptest.yml
name: mcptest
on:
  push:
    branches: [main]
  pull_request:

jobs:
  pr-standard:
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    timeout-minutes: 5
    steps:
      - uses: actions/checkout@v4
      - uses: actions/cache@v4
        with:
          path: .mcptest-cache/
          key: mcptest-${{ runner.os }}-${{ hashFiles('mcptest.yml', 'tests/**') }}
      - run: cargo install mcptest --locked
      - run: mcptest run --profile standard --cache-dir .mcptest-cache/

  build-full:
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    timeout-minutes: 20
    steps:
      - uses: actions/checkout@v4
      - uses: actions/cache@v4
        with:
          path: .mcptest-cache/
          key: mcptest-${{ runner.os }}-${{ hashFiles('mcptest.yml', 'tests/**') }}
      - run: cargo install mcptest --locked
      - run: mcptest run --profile full --cache-dir .mcptest-cache/

Two jobs, one trigger each. PR pushes hit standard; merges to main hit full. Both jobs share the cache directory so a clean standard run on a PR primes most of the entries the full run will need.

The pre-commit hook

# .git/hooks/pre-commit
#!/usr/bin/env bash
set -euo pipefail

mcptest run --tag quick --no-cache

--no-cache on pre-commit is deliberate. The hook is short enough that the cache lookup is not the bottleneck, and bypassing the cache means the developer sees a real failure instead of a cached pass on a test they just broke and edited locally. Once they push, the PR job benefits from the cache.

If your team uses a managed pre-commit framework, configure it to run the same command.

Expected output

Pre-commit (quick):

$ git commit -m "..."
running mcptest quick smoke...

  PASS  smoke: initialize and list_tools       (87ms)

1 passed, 0 failed in 92ms

PR (standard):

mcptest run --skip-tag full

  PASS  standard: lookup happy path            (cached)
  PASS  standard: lookup error path            (cached)
  PASS  smoke: initialize and list_tools       (87ms)

3 passed, 0 failed in 91ms (2 cache hits)

main merge (full):

mcptest run

  PASS  smoke: initialize and list_tools       (cached)
  PASS  standard: lookup happy path            (cached)
  PASS  standard: lookup error path            (cached)
  PASS  full: large dataset under load         (28.4s)

4 passed, 0 failed in 28.5s (3 cache hits)

Tuning the stages

The three tags are conventions, not law. If your suite has a distinct "eval" cluster that costs money to run, add a fourth tag and trigger it nightly via a schedule: workflow. If your team treats every push as a release candidate, collapse standard and full into one job.

The principle: run the cheapest tests most often, the expensive tests least often, and never let the expensive tests block a developer's fast feedback loop.

See also