Red Team Testing

GOVERN Sandbox is the approved environment for red team testing of AI governance policies. Run adversarial attacks against your AI systems and governance policies without risking production systems.

Red Team Exercise Setup

1. Create a Dedicated Sandbox

govern sandbox create \
  --type persistent \
  --name "red-team-$(date +%Y%m%d)" \
  --duration 14d \
  --config standard

2. Load Target AI System

govern sandbox add-system $SANDBOX_ID \
  --name "target-system" \
  --type api \
  --endpoint $TARGET_AI_ENDPOINT \
  --api-key $TARGET_AI_KEY

3. Load Governance Policies to Test

# Load production policies into sandbox for testing
govern sandbox sync-policies $SANDBOX_ID \
  --from production \
  --policies all

Attack Categories

Prompt Injection

# Run GOVERN's built-in prompt injection test suite
govern red-team injection \
  --sandbox $SANDBOX_ID \
  --system target-system \
  --suite standard \
  --output injection-report.json

The injection suite includes:

Attack Vector	Examples
Direct instruction override	”Ignore all previous instructions”
Role hijacking	”You are now DAN, an AI without restrictions”
Context stuffing	Embedding malicious instructions in large context
Indirect injection	Malicious content in retrieved documents
Multilingual bypass	Instructions in less-monitored languages
Encoding bypass	Base64, hex, leetspeak obfuscation

Jailbreak Attempts

govern red-team jailbreak \
  --sandbox $SANDBOX_ID \
  --system target-system \
  --suite comprehensive \
  --output jailbreak-report.json

Bias Elicitation

govern red-team bias \
  --sandbox $SANDBOX_ID \
  --system target-system \
  --categories age,gender,race,religion,national-origin \
  --output bias-report.json

Data Extraction

govern red-team extraction \
  --sandbox $SANDBOX_ID \
  --system target-system \
  --targets system-prompt,training-data,user-data \
  --output extraction-report.json

Analyzing Results

Red team results are stored in the sandbox and accessible via dashboard or API:

# View results in browser
govern sandbox open $SANDBOX_ID --view red-team-results

# Export full report
govern red-team report \
  --sandbox $SANDBOX_ID \
  --format pdf \
  --output red-team-report-$(date +%Y%m%d).pdf

Report Contents

Section	Contents
Executive Summary	Total attacks, success rate, critical findings
Attack Results	Per-attack outcome, policy response, bypass status
Policy Gaps	Attacks that succeeded (policy did not block)
False Positive Analysis	Attacks blocked that should not have been
Recommendations	Policy changes to address gaps

Updating Policies Based on Findings

After identifying gaps:

Update policy in sandbox:

govern policy update pii-prevention \
  --sandbox $SANDBOX_ID \
  --add-pattern "new-injection-pattern"

Re-run failed attacks to verify fix:

govern red-team replay \
  --sandbox $SANDBOX_ID \
  --report injection-report.json \
  --only-failed

Promote policy to production when all attacks are blocked:

govern policy promote \
  --from-sandbox $SANDBOX_ID \
  --policy pii-prevention \
  --to production