PolyGuard is a comprehensive multi-domain safety policy-grounded guardrail dataset designed to evaluate and benchmark content safety models across eight critical domains. Built on authentic, domain-specific safety policies, PolyGuard provides a robust framework for testing guardrail effectiveness in real-world scenarios.
- π Massive Scale: 100k+ data instances spanning 8 safety-critical domains
- ποΈ Policy-Grounded: Based on 150+ authentic safety policies from real organizations
- π― Comprehensive Coverage: 400+ risk categories and 1000+ safety rules
- π Diverse Formats: Declarative statements, questions, instructions, and multi-turn conversations
- π‘οΈ Attack-Enhanced: Sophisticated adversarial testing scenarios
- βοΈ Challenging Benign Data: Detoxified prompts to test over-refusal behaviors
- Platforms: Reddit, Twitter/X, Instagram, Discord, YouTube, Spotify
- Focus: Content moderation, community guidelines, platform-specific safety rules
- Features: Multi-platform policy extraction, attack scenario testing
- Regulators: FINRA, BIS, OECD, Treasury, Alan
- Focus: Financial compliance, regulatory requirements, investment advice safety
- Features: Policy extraction from PDFs, adversarial rephrasing
- Organizations: ABA, California Bar, DC Bar, Florida Bar, NCSC, Texas Bar, UK Judiciary
- Focus: Legal ethics, professional conduct, attorney-client privilege
- Features: Multi-stage workflow, iterative adversarial attacks
- Institutions: Google, Microsoft, Amazon, Apple, Meta, NVIDIA, IBM, Intel, Adobe, ByteDance
- Educational: College Board AP, CSU, AAMC, AI for Education, McGovern Med, NIU, UNESCO, IB
- Focus: Academic integrity, educational technology safety, student protection
- Companies: Google, Microsoft, Amazon, Apple, Meta, NVIDIA, IBM, Intel, Adobe, ByteDance
- Focus: Workplace safety, discrimination prevention, harassment policies
- Features: Policy-based data generation, comprehensive evaluation
- Categories: CVE, Malware, MITRE ATT&CK, Phishing, Code Interpreter
- Focus: Security threats, vulnerability assessment, malicious code detection
- Features: Multi-format evaluation (prompt/chat), comprehensive model testing
- Categories: GPT Bias, Insecure Code
- Focus: AI-generated code safety, bias detection, security vulnerabilities
- Features: Code-specific evaluation, bias assessment
- Frameworks: GDPR, EU AI Act
- Focus: Regulatory compliance, privacy protection, AI governance
- Features: Attack evaluation, conversation assessment, query analysis
# Clone the repository
git clone https://github.com/your-org/PolyGuard.git
cd PolyGuard
# Install dependencies
pip install -r requirement.txt
# Set up environment variables
export OPENAI_API_KEY="your-openai-api-key"
cd social_media
# Standard evaluation across all platforms
sh run.sh
# Attack-enhanced evaluation
sh attack.sh
# Individual model evaluation
python main.py --model meta-llama/Llama-Guard-4-12B --domain Reddit --device cuda:0
cd finance
# Extract policies from PDFs
python extract_generate.py --name finra --model_name o4-mini-2025-04-16
# Rephrase malicious requests
python rephrase.py --model-name o4-mini-2025-04-16
# Evaluate guardrail performance
python eval.py --name finra --evaluate-input
python eval.py --name finra
cd law
# Extract legal policies
python extract_generate.py --name aba --model_name o4-mini-2025-04-16
# Evaluate guardrail models
python eval.py --name aba --evaluate-input
python eval.py --name aba
# Run adversarial attacks
python attack.py --name aba
cd education
# Evaluate on education policies
python eval.py --model_id meta-llama/Llama-Guard-3-8B
cd hr
# Evaluate on HR policies
python eval.py --model_id meta-llama/Llama-Guard-3-8B
cd cyber
# Evaluate on cybersecurity datasets
python evaluate.py --input_file data/cve_full.json --prompt_or_chat prompt
python evaluate.py --input_file data/malware_full.json --prompt_or_chat chat
cd code
# Evaluate on code safety datasets
python evaluate.py --input_file data/GPT_bias_full.json --prompt_or_chat prompt
python evaluate.py --input_file data/insecure_code_full.json --prompt_or_chat chat
cd regulation
# Attack evaluation on regulatory compliance
python evaluate_attack.py --model_id OpenSafetyLab/MD-Judge-v0_2-internlm2_7b
# Conversation evaluation
python evaluate_conversation.py --model_id meta-llama/Llama-Guard-3-8B
# Query evaluation
python evaluate_query.py --model_id allenai/wildguard
PolyGuard evaluates 18+ state-of-the-art content safety models:
meta-llama/Llama-Guard-4-12B
- Latest LlamaGuard modelmeta-llama/Llama-Guard-3-8B
- LlamaGuard 3 8B parameter modelmeta-llama/Meta-Llama-Guard-2-8B
- LlamaGuard 2 modelmeta-llama/Llama-Guard-3-1B
- Lightweight LlamaGuard 3 modelmeta-llama/LlamaGuard-7b
- Original LlamaGuard model
google/shieldgemma-2b
- ShieldGemma 2B parameter modelgoogle/shieldgemma-9b
- ShieldGemma 9B parameter model
text-moderation-latest
- OpenAI text moderation APIomni-moderation-latest
- OpenAI omni-moderation API
OpenSafetyLab/MD-Judge-v0_2-internlm2_7b
- MD-Judge v0.2 modelOpenSafetyLab/MD-Judge-v0.1
- MD-Judge v0.1 modelallenai/wildguard
- WildGuard model
nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0
- Permissive Aegis modelnvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0
- Defensive Aegis model
ibm-granite/granite-guardian-3.2-3b-a800m
- Granite Guardian 3.2B modelibm-granite/granite-guardian-3.2-5b
- Granite Guardian 5B model
llmjudge
- LLMJudge evaluation modelazure
- Azure Content Safety APIaws
- AWS Bedrock safety models
PolyGuard provides comprehensive evaluation metrics:
- Precision - Ratio of correctly flagged unsafe content to total flagged content
- Recall - Ratio of correctly flagged unsafe content to total unsafe content
- F1-Score - Harmonic mean of precision and recall
- Accuracy - Overall classification accuracy
- Per-Category Performance - Metrics for specific safety rule violations
- Attack Success Rate - Effectiveness of adversarial bypass attempts
PolyGuard includes sophisticated attack evaluation to test guardrail robustness:
- Reasoning Distraction - Using logical puzzles to distract from unsafe content
- Category Shifting - Attempting to shift safety category definitions
- Instruction Forcing - Direct attempts to override safety instructions
- Iterative Refinement - Continuous prompt refinement using GPT-4
- Adversarial Prompts - Carefully crafted prompts to bypass safety filters
- Context Manipulation - Modifying context to hide unsafe content
- Policy Exploitation - Exploiting gaps in safety policy definitions
PolyGuard/
βββ social_media/ # Social media platform evaluation
β βββ datagen/ # Data generation and policy processing
β βββ guardrail_model/ # Guardrail model implementations
β βββ main.py # Standard evaluation
β βββ main_attack.py # Attack evaluation
β βββ results/ # Evaluation results
βββ finance/ # Financial domain evaluation
β βββ policy_pdf/ # Regulatory PDFs
β βββ extract_generate.py
β βββ eval.py
β βββ attack.py
βββ law/ # Legal domain evaluation
β βββ policy_pdf/ # Legal policy PDFs
β βββ extract_generate.py
β βββ eval.py
β βββ attack.py
βββ education/ # Education domain evaluation
β βββ eval.py
β βββ results/ # Institution-specific results
βββ hr/ # HR domain evaluation
β βββ eval.py
β βββ results/ # Company-specific results
βββ cyber/ # Cybersecurity domain evaluation
β βββ data/ # Security datasets
β βββ evaluate.py
βββ code/ # Code generation domain evaluation
β βββ data/ # Code safety datasets
β βββ evaluate.py
βββ regulation/ # Regulatory compliance evaluation
β βββ evaluate_attack.py
β βββ evaluate_conversation.py
β βββ evaluate_query.py
βββ requirement.txt # Python dependencies
export OPENAI_API_KEY="your-openai-api-key"
export HUGGINGFACE_TOKEN="your-huggingface-token" # For private models
- Device: Specify GPU device with
--device cuda:0
- Cache Directory: Set model cache with
--cache_dir /path/to/cache
- Batch Processing: Configure batch windows and polling intervals
Results are stored in domain-specific directories with the following structure:
{MODEL}_precision.json
- Precision metrics by category{MODEL}_recall.json
- Recall metrics by category{MODEL}_f1.json
- F1-score metrics by category{MODEL}_all_records.jsonl
- Complete evaluation records
Attack results include:
- Attack success rates
- Adversarial prompt effectiveness
- Guardrail bypass patterns
- Iterative refinement logs
We welcome contributions to PolyGuard! Please see our contributing guidelines for:
- Adding new domains
- Implementing new guardrail models
- Improving evaluation metrics
- Enhancing attack scenarios