Governance Benchmarks

GOVERN Sandbox includes a governance benchmarking suite that measures how well your governance configuration performs against standard test suites. Run benchmarks before major releases to detect regressions.

Running Benchmarks

govern benchmark run \
  --sandbox $SANDBOX_ID \
  --suite comprehensive \
  --output benchmark-results.json

Benchmark Suites

Standard Suite

Covers common governance scenarios. Runs in ~15 minutes.

govern benchmark run --suite standard

Category	Tests	What Is Measured
Policy enforcement	50	Policy fires on violations, not on compliant inputs
Prompt injection resistance	100	Attack blocked rate
Bias detection	75	Disparity flagged accurately
False positive rate	50	Compliant inputs not blocked
Latency	100	p50/p95/p99 governance overhead

Comprehensive Suite

Full benchmark including adversarial and edge cases. Runs in ~60 minutes.

govern benchmark run --suite comprehensive

Adds to standard:

Category	Tests	What Is Measured
Advanced injection	200	Novel and obfuscated attacks
Multi-turn attacks	50	Context manipulation across turns
Boundary conditions	75	Edge cases at policy thresholds
Concurrent load	25	Governance accuracy under load
Drift simulation	30	Score change detection speed

Federal Compliance Suite

Runs all NIST/CMMC/FedRAMP required governance checks.

govern benchmark run --suite federal --framework fedramp-high

Benchmark Metrics

Metric	Description	Target
Policy precision	True positives / (True pos + False pos)	> 95%
Policy recall	True positives / (True pos + False neg)	> 99%
Injection block rate	Attacks blocked / Total attacks	> 97%
False positive rate	Compliant inputs blocked / Total compliant	< 2%
p50 governance latency	Median added latency	< 30ms
p99 governance latency	99th percentile added latency	< 150ms

Comparing Benchmark Results

Compare two benchmark runs to detect regressions:

govern benchmark compare \
  --baseline benchmark-v2.1.0.json \
  --current benchmark-v2.2.0.json \
  --output regression-report.json

A regression is flagged when any metric degrades more than 2% vs. baseline.

Benchmark in CI/CD

Run benchmarks as part of your release pipeline:

- name: Create benchmark sandbox
  run: |
    SANDBOX=$(govern sandbox create --type ephemeral --duration 2h --json | jq -r .id)
    echo "SANDBOX_ID=$SANDBOX" >> $GITHUB_ENV

- name: Run benchmarks
  run: |
    govern benchmark run \
      --sandbox $SANDBOX_ID \
      --suite standard \
      --output results.json

- name: Check benchmark pass
  run: |
    govern benchmark check results.json \
      --min-precision 0.95 \
      --min-recall 0.99 \
      --max-false-positive-rate 0.02 \
      --max-p99-latency-ms 150