Governance Benchmarks
GOVERN Sandbox includes a governance benchmarking suite that measures how well your governance configuration performs against standard test suites. Run benchmarks before major releases to detect regressions.
Running Benchmarks
govern benchmark run \ --sandbox $SANDBOX_ID \ --suite comprehensive \ --output benchmark-results.jsonBenchmark Suites
Standard Suite
Covers common governance scenarios. Runs in ~15 minutes.
govern benchmark run --suite standard| Category | Tests | What Is Measured |
|---|---|---|
| Policy enforcement | 50 | Policy fires on violations, not on compliant inputs |
| Prompt injection resistance | 100 | Attack blocked rate |
| Bias detection | 75 | Disparity flagged accurately |
| False positive rate | 50 | Compliant inputs not blocked |
| Latency | 100 | p50/p95/p99 governance overhead |
Comprehensive Suite
Full benchmark including adversarial and edge cases. Runs in ~60 minutes.
govern benchmark run --suite comprehensiveAdds to standard:
| Category | Tests | What Is Measured |
|---|---|---|
| Advanced injection | 200 | Novel and obfuscated attacks |
| Multi-turn attacks | 50 | Context manipulation across turns |
| Boundary conditions | 75 | Edge cases at policy thresholds |
| Concurrent load | 25 | Governance accuracy under load |
| Drift simulation | 30 | Score change detection speed |
Federal Compliance Suite
Runs all NIST/CMMC/FedRAMP required governance checks.
govern benchmark run --suite federal --framework fedramp-highBenchmark Metrics
| Metric | Description | Target |
|---|---|---|
| Policy precision | True positives / (True pos + False pos) | > 95% |
| Policy recall | True positives / (True pos + False neg) | > 99% |
| Injection block rate | Attacks blocked / Total attacks | > 97% |
| False positive rate | Compliant inputs blocked / Total compliant | < 2% |
| p50 governance latency | Median added latency | < 30ms |
| p99 governance latency | 99th percentile added latency | < 150ms |
Comparing Benchmark Results
Compare two benchmark runs to detect regressions:
govern benchmark compare \ --baseline benchmark-v2.1.0.json \ --current benchmark-v2.2.0.json \ --output regression-report.jsonA regression is flagged when any metric degrades more than 2% vs. baseline.
Benchmark in CI/CD
Run benchmarks as part of your release pipeline:
- name: Create benchmark sandbox run: | SANDBOX=$(govern sandbox create --type ephemeral --duration 2h --json | jq -r .id) echo "SANDBOX_ID=$SANDBOX" >> $GITHUB_ENV
- name: Run benchmarks run: | govern benchmark run \ --sandbox $SANDBOX_ID \ --suite standard \ --output results.json
- name: Check benchmark pass run: | govern benchmark check results.json \ --min-precision 0.95 \ --min-recall 0.99 \ --max-false-positive-rate 0.02 \ --max-p99-latency-ms 150