Skip to content

Governance Benchmarks

GOVERN Sandbox includes a governance benchmarking suite that measures how well your governance configuration performs against standard test suites. Run benchmarks before major releases to detect regressions.

Running Benchmarks

Terminal window
govern benchmark run \
--sandbox $SANDBOX_ID \
--suite comprehensive \
--output benchmark-results.json

Benchmark Suites

Standard Suite

Covers common governance scenarios. Runs in ~15 minutes.

Terminal window
govern benchmark run --suite standard
CategoryTestsWhat Is Measured
Policy enforcement50Policy fires on violations, not on compliant inputs
Prompt injection resistance100Attack blocked rate
Bias detection75Disparity flagged accurately
False positive rate50Compliant inputs not blocked
Latency100p50/p95/p99 governance overhead

Comprehensive Suite

Full benchmark including adversarial and edge cases. Runs in ~60 minutes.

Terminal window
govern benchmark run --suite comprehensive

Adds to standard:

CategoryTestsWhat Is Measured
Advanced injection200Novel and obfuscated attacks
Multi-turn attacks50Context manipulation across turns
Boundary conditions75Edge cases at policy thresholds
Concurrent load25Governance accuracy under load
Drift simulation30Score change detection speed

Federal Compliance Suite

Runs all NIST/CMMC/FedRAMP required governance checks.

Terminal window
govern benchmark run --suite federal --framework fedramp-high

Benchmark Metrics

MetricDescriptionTarget
Policy precisionTrue positives / (True pos + False pos)> 95%
Policy recallTrue positives / (True pos + False neg)> 99%
Injection block rateAttacks blocked / Total attacks> 97%
False positive rateCompliant inputs blocked / Total compliant< 2%
p50 governance latencyMedian added latency< 30ms
p99 governance latency99th percentile added latency< 150ms

Comparing Benchmark Results

Compare two benchmark runs to detect regressions:

Terminal window
govern benchmark compare \
--baseline benchmark-v2.1.0.json \
--current benchmark-v2.2.0.json \
--output regression-report.json

A regression is flagged when any metric degrades more than 2% vs. baseline.

Benchmark in CI/CD

Run benchmarks as part of your release pipeline:

.github/workflows/govern-benchmark.yml
- name: Create benchmark sandbox
run: |
SANDBOX=$(govern sandbox create --type ephemeral --duration 2h --json | jq -r .id)
echo "SANDBOX_ID=$SANDBOX" >> $GITHUB_ENV
- name: Run benchmarks
run: |
govern benchmark run \
--sandbox $SANDBOX_ID \
--suite standard \
--output results.json
- name: Check benchmark pass
run: |
govern benchmark check results.json \
--min-precision 0.95 \
--min-recall 0.99 \
--max-false-positive-rate 0.02 \
--max-p99-latency-ms 150