Current AI Safety Solutions¶
A reference guide to production-ready guardrail, evaluation, and safety solutions implementing the three-layer pattern.
Solutions At a Glance¶
| Solution | Type | What It Does | Layer | Open Source | Key Limitation |
|---|---|---|---|---|---|
| AWS Bedrock Guardrails | Managed | Content filtering, PII detection, hallucination checks, denied topics | Guardrails | No | 30 denied topic limit; cross-region IAM issues |
| Azure AI Content Safety | Managed | Harm classification (0-7 severity), prompt shields, groundedness | Guardrails | No | English-optimized; 10K char limit per request |
| NVIDIA NeMo Guardrails | Framework | Programmable rails (input/output/dialog/retrieval/execution) | Guardrails | Yes | Dialog rails don't work with reasoning models |
| Guardrails AI | Framework | Output validation, structured output enforcement, retry logic | Guardrails | Yes | Output-focused; less input validation |
| Llama Guard 3/4 | Model | LLM-based content classification (safe/unsafe + category) | Guardrails/Judge | Yes | ~33% attack bypass rate; English-optimized |
| OpenAI Moderation API | API | Harm classification across categories | Guardrails | No | OpenAI models only; limited customization |
| DeepEval | Framework | LLM-as-judge evaluation, 50+ metrics, CI/CD integration | Judge | Yes | LLM calls add cost/latency at scale |
| Galileo | Platform | Eval-to-guardrail lifecycle, Luna models for monitoring | Judge | No | Platform dependency |
| Prompt Guard (Meta) | Model | Prompt injection and jailbreak detection | Guardrails | Yes | Needs fine-tuning for best results |
| LlamaFirewall (Meta) | Tool | Security guardrail for AI systems | Guardrails | Yes | Early stage |
Solutions by Use Case¶
| If You Need... | Primary Choice | Alternative |
|---|---|---|
| Turnkey AWS guardrails | AWS Bedrock Guardrails | — |
| Turnkey Azure guardrails | Azure AI Content Safety | — |
| Self-hosted, customizable | NVIDIA NeMo Guardrails | Guardrails AI |
| Open-source safety model | Llama Guard 3/4 | Prompt Guard |
| LLM evaluation/testing | DeepEval | Galileo |
| Production monitoring | Confident AI (DeepEval) | Galileo |
| Structured output validation | Guardrails AI | NeMo Guardrails |
| Multimodal content safety | Azure AI Content Safety | Llama Guard 4 |
| Hallucination detection | AWS Bedrock (Automated Reasoning) | DeepEval metrics |
Solutions by Layer¶
Guardrails Layer (Real-time, ~10-100ms)¶
| Solution | Input | Output | Multimodal | Customizable | Self-Hosted |
|---|---|---|---|---|---|
| AWS Bedrock Guardrails | ✓ | ✓ | Images (preview) | Limited | No |
| Azure AI Content Safety | ✓ | ✓ | ✓ | Custom categories | No |
| NVIDIA NeMo Guardrails | ✓ | ✓ | Limited | Highly | Yes |
| Guardrails AI | Limited | ✓ | No | Highly | Yes |
| Llama Guard | ✓ | ✓ | Llama Guard 4 | Via prompting | Yes |
| OpenAI Moderation | ✓ | ✓ | No | No | No |
Judge Layer (Async, ~500ms-5s)¶
| Solution | Metrics | Custom Criteria | Production Monitoring | CI/CD |
|---|---|---|---|---|
| DeepEval | 50+ | G-Eval, DAG | Via Confident AI | ✓ |
| Galileo | Multiple | ✓ | Built-in | ✓ |
| Custom LLM prompts | Unlimited | ✓ | DIY | DIY |
Industry Context¶
The AI security industry has converged on a common pattern: layered runtime controls combining fast filtering (guardrails), deeper evaluation (LLM-as-judge), and human oversight. This guide catalogs the major solutions implementing this pattern, with honest assessments of capabilities, limitations, and appropriate use cases.
This page exists to give credit where it's due and help practitioners select appropriate tools. The Framework synthesizes and explains the pattern these solutions implement.
Quick Reference: Solution Categories¶
| Category | Purpose | Examples |
|---|---|---|
| Platform Guardrails | Cloud-native filtering integrated with AI services | AWS Bedrock Guardrails, Azure AI Content Safety |
| Open-Source Frameworks | Self-hosted, customizable guardrail systems | NVIDIA NeMo Guardrails, Guardrails AI |
| Safety Models | LLM-based content moderation | Llama Guard, OpenAI Moderation API |
| Evaluation Frameworks | LLM-as-Judge implementation | DeepEval, Galileo |
| Standards & Guidance | Risk frameworks and taxonomies | OWASP LLM Top 10, NIST AI RMF |
Platform Guardrails¶
AWS Bedrock Guardrails¶
Overview: Managed guardrail service integrated with Amazon Bedrock foundation models. Provides content filtering, PII detection, denied topics, and (uniquely) automated reasoning checks for hallucination detection.
How It Works: - Evaluates both user inputs and model responses against configured policies - Six safeguard types: content filters, denied topics, word filters, sensitive info, contextual grounding, automated reasoning - Can be used via API without invoking the model (ApplyGuardrail API) - Works with any model (Bedrock-hosted or external via API)
Strengths: - Automated Reasoning checks claim 99% accuracy for hallucination detection (AWS claim) - Blocks up to 88% of harmful content (AWS benchmark) - Native integration with Bedrock agents, knowledge bases, and flows - Cross-model consistency — same guardrails work across different FMs
Limitations: - Cross-region complexity: Known IAM permission issues when guardrails and agents are in different regions - Input tagging limitations: Not currently supported with managed prompts - Latency cost: Adds processing time; charges apply even when blocking input - 30 denied topic limit: May be insufficient for complex policy sets
Known Issues: - Access denied errors when using cross-region guardrails with Bedrock Agents (requires careful IAM configuration) - VPC endpoint limitations for cross-region access - Streaming not fully supported with all guardrail configurations
Best For: Organizations already using AWS Bedrock who want turnkey guardrails with minimal setup.
Not Recommended For: Complex multi-region deployments without careful IAM planning; use cases requiring more than 30 denied topics.
Pricing: Per 1,000 text units (1,000 characters each). Word filters free. See AWS Pricing.
Documentation: AWS Bedrock Guardrails
Azure AI Content Safety¶
Overview: Microsoft's content moderation service providing text and image analysis with severity scoring across harm categories.
How It Works: - Multi-class classification for hate, violence, sexual content, self-harm - Severity levels 0-7 for text, 0-3 for images - Prompt Shields for jailbreak and injection detection - Groundedness detection for hallucination - Protected material detection for copyright
Strengths: - Multimodal support (text, images, text+image) - Granular severity scoring (not just binary) - Custom categories API for domain-specific content - Integration with Azure OpenAI and Foundry - Protected material detection for copyright compliance
Limitations: - Language support: Optimized for English; performance varies for other languages (German, Japanese, Spanish, French, Italian, Portuguese, Chinese supported) - 10K character limit: Per submission for text moderation - Image recognition limits: May miss content in unclear or edited images - Cannot detect CSAM: Explicitly stated limitation - Evolving threats: May not keep pace with new attack techniques
Known Issues: - False positives reported in scientific/medical contexts (pharmaceutical companies report legitimate content being flagged) - Groundedness detection inconsistencies (some users report it returns empty results) - Content filter token costs can be significant (reported 10x other costs in some deployments)
Best For: Microsoft Azure customers needing content moderation with severity scoring and multimodal support.
Not Recommended For: Non-English content at scale; scientific/medical applications without custom configuration.
Pricing: Per text record (1,000 characters) and per image. See Azure Pricing.
Documentation: Azure AI Content Safety
Open-Source Frameworks¶
NVIDIA NeMo Guardrails¶
Overview: Open-source Python library for adding programmable guardrails to LLM applications. Highly customizable with support for multiple rail types and integration with major LLM providers.
How It Works: - Five rail types: input, dialog, retrieval, execution, output - Colang 2.0 DSL for defining conversational flows - Can orchestrate multiple rails with configurable execution order - Supports GPU acceleration for low-latency performance
Strengths: - Highly programmable: Colang DSL allows complex policy logic - Multi-rail orchestration: Coordinate input, dialog, retrieval, execution, and output rails - LLM provider agnostic: Works with OpenAI, Azure, Anthropic, HuggingFace, NIM - LangChain/LangGraph integration: Native support for popular frameworks - GPU acceleration: NVIDIA hardware optimization for performance
Limitations: - Learning curve: Colang DSL requires learning - LLM dependency: Most rails require an LLM for evaluation (adds latency/cost) - Dialog rails not supported with reasoning models: Documented limitation - Built-in rails may not suit production: NVIDIA explicitly states "may or may not be suitable for a given production use case"
Known Issues: - Jailbreak detection container setup issues reported (GitHub Issue #690) - Reasoning traces can interfere with guardrails, triggering false positives - Threads not supported in streaming mode - No automatic thread cleanup mechanism
Vendor Recommendation: NVIDIA states developers should "work with their internal application team to ensure guardrails meets [their] requirements" — tune for your use case.
Best For: Teams needing highly customizable, self-hosted guardrails with complex policy logic.
Not Recommended For: Simple use cases where managed services suffice; teams without Python/ML expertise.
License: Apache 2.0
Documentation: NeMo Guardrails Docs
GitHub: github.com/NVIDIA/NeMo-Guardrails
Guardrails AI¶
Overview: Open-source Python framework for adding structural and semantic validation to LLM outputs. Focus on output validation with a library of reusable validators.
How It Works: - Define "guards" that validate LLM outputs - Validator library (Guardrails Hub) with pre-built checks - Supports structured output validation (JSON, etc.) - Can retry/reask on validation failure
Strengths: - Validator ecosystem: Large library of pre-built validators - Structured output focus: Strong at ensuring output format compliance - Retry logic: Automatic correction on validation failure - Simple API: Easy to integrate
Limitations: - Output-focused: Less comprehensive for input validation - LLM dependency: Many validators require LLM calls - Limited multimodal: Primarily text-focused
Best For: Applications requiring structured LLM outputs with validation; RAG pipelines needing output quality checks.
License: Apache 2.0
Documentation: guardrailsai.com
GitHub: github.com/guardrails-ai/guardrails
Safety Models¶
Meta Llama Guard¶
Overview: LLM-based input/output moderation model from Meta, fine-tuned for safety classification. Available in multiple versions (Llama Guard 1, 2, 3, 4) with evolving capabilities.
How It Works: - Fine-tuned Llama model that classifies content as safe/unsafe - Outputs category of violation when unsafe - Instruction-tunable — can adapt to custom taxonomies via prompting - Available in quantized versions for lower deployment cost
Versions: | Version | Base Model | Languages | Categories | |---------|------------|-----------|------------| | Llama Guard 3 | Llama 3 | 8 languages | 14 (MLCommons taxonomy) | | Llama Guard 4 | Llama 4 Scout (12B) | Multilingual | MLCommons + custom |
Strengths: - Open weights: Self-hostable, customizable - Instruction-tunable: Adapt to custom policies via prompting - MLCommons aligned: Standard taxonomy for interoperability - Multilingual: Llama Guard 3+ supports 8 languages - Tool use awareness: Can detect code interpreter abuse
Limitations: - English-optimized: Performance varies in other languages - Context sensitivity: May flag therapeutic discussions of self-harm - Adversarial vulnerability: As an LLM, susceptible to prompt injection - False positive rate: May increase refusals to benign prompts - Attack bypass rate: Independent testing shows ~33% of attacks bypass protection
Known Issues: - Llama Guard is an LLM and can be prompted to generate any text (not just classifications) - Performance on custom taxonomies requires fine-tuning for best results - Longer context windows can reduce guardrail effectiveness
Meta's Recommendation: "There is no one-size-fits-all guardrail detection to prevent all risks. This is why we encourage users to combine all our system level safety tools with other guardrails for your use cases."
Best For: Organizations wanting self-hosted safety classification with customization capability.
Not Recommended For: Production use without additional guardrail layers; non-English deployments without testing.
License: Llama Community License (requires "Built with Llama" attribution)
Documentation: Llama Protections
Models: Llama Guard 3 on HuggingFace
OpenAI Moderation API¶
Overview: OpenAI's content moderation endpoint for detecting harmful content in text.
How It Works: - API endpoint that classifies text across harm categories - Returns category flags and confidence scores - Free to use for OpenAI API customers
Strengths: - Free: No additional cost for OpenAI customers - Simple API: Single endpoint, easy integration - Fast: Low latency classification
Limitations: - OpenAI ecosystem only: Designed for OpenAI models - Text only: No multimodal support - Limited customization: Cannot adapt categories - English-focused: Performance varies in other languages
Best For: Quick content filtering for OpenAI-based applications.
Documentation: OpenAI Moderation
Evaluation Frameworks (LLM-as-Judge)¶
DeepEval / Confident AI¶
Overview: Open-source LLM evaluation framework providing pytest-like testing for LLM applications. Supports both development-time benchmarking and production monitoring.
How It Works: - Define test cases with inputs, outputs, and expected behaviors - Run metrics (G-Eval, hallucination, relevancy, etc.) against outputs - LLM-as-judge approach for most metrics - Integrates with CI/CD pipelines - Confident AI cloud platform for collaboration and monitoring
Strengths: - Comprehensive metrics: 50+ research-backed evaluation metrics - Flexible: Works with any LLM provider - Production monitoring: Async evals in production via Confident AI - Red teaming: Built-in adversarial testing for 40+ vulnerabilities - Component-level evals: Can evaluate individual pipeline components
Limitations: - LLM dependency: Most metrics require LLM calls (cost, latency) - Rate limits: Can hit LLM provider limits during large evaluations - Metric-outcome fit: Metrics may not correlate with business outcomes without calibration
Known Issues: - Rate limit errors common during large evaluations - False positives/negatives require metric tuning - Production evals need async architecture to avoid blocking
Best For: Teams needing comprehensive LLM evaluation with CI/CD integration.
License: Apache 2.0 (open-source); Confident AI platform has free and paid tiers
Documentation: deepeval.com
GitHub: github.com/confident-ai/deepeval
Galileo¶
Overview: LLM evaluation platform with "eval-to-guardrail" lifecycle — evaluations developed in testing become production guardrails.
How It Works: - Define evaluation criteria during development - Test against datasets to calibrate - Deploy same evals as production guardrails - Luna models provide low-cost monitoring
Strengths: - Unified lifecycle: Evals → Guardrails workflow - Low-cost monitoring: Luna models for production - Observability: Built-in tracing and debugging
Limitations: - Platform dependency: Requires Galileo platform - Proprietary: Less flexibility than open-source options
Best For: Teams wanting integrated eval-to-production workflow.
Documentation: rungalileo.io
Standards and Guidance¶
OWASP LLM Top 10 (2025)¶
Overview: Industry-standard taxonomy of security risks for LLM applications, maintained by OWASP with 500+ contributors.
Categories: 1. Prompt Injection 2. Insecure Output Handling 3. Training Data Poisoning 4. Model Denial of Service 5. Supply Chain Vulnerabilities 6. Sensitive Information Disclosure 7. Insecure Plugin Design 8. Excessive Agency 9. Overreliance 10. Model Theft
Also Available: OWASP Top 10 for Agentic Applications (December 2025)
Use For: Risk identification, security assessments, compliance documentation.
Documentation: owasp.org/www-project-top-10-for-large-language-model-applications
NIST AI Risk Management Framework¶
Overview: US government framework for AI risk management with four core functions: Govern, Map, Measure, Manage.
Use For: Enterprise AI governance, federal compliance, risk assessment structure.
Documentation: nist.gov/itl/ai-risk-management-framework
ISO 42001¶
Overview: International standard for AI management systems. Certifiable.
Use For: Formal AI governance certification, enterprise compliance.
Documentation: iso.org/standard/81230.html
Emerging Solutions¶
LlamaFirewall (Meta)¶
Security guardrail tool for building secure AI systems. Part of Meta's Llama Protections suite.
Prompt Guard (Meta)¶
Multi-label classifier for detecting prompt injections and jailbreaks. Available in 86M and 22M parameter versions.
CyberSecEval (Meta)¶
Benchmarks for measuring LLM cybersecurity risks and defensive capabilities.
Lasso Security Secure Gateway¶
Model-agnostic security gateway providing guardrails across any AI platform.
Solution Selection Guide¶
| If You Need... | Consider |
|---|---|
| Turnkey AWS integration | AWS Bedrock Guardrails |
| Turnkey Azure integration | Azure AI Content Safety |
| Highly customizable, self-hosted | NVIDIA NeMo Guardrails |
| Output validation focus | Guardrails AI |
| Self-hosted safety model | Llama Guard |
| LLM evaluation/testing | DeepEval |
| Integrated eval-to-production | Galileo |
| Risk taxonomy | OWASP LLM Top 10 |
| Governance framework | NIST AI RMF, ISO 42001 |
Common Limitations Across Solutions¶
| Limitation | Description | Mitigation |
|---|---|---|
| Language coverage | Most optimized for English | Test non-English thoroughly; consider translation layers |
| Novel attacks | Pattern-based detection misses new techniques | Combine with behavioral monitoring; update regularly |
| False positives | Over-blocking legitimate content | Tune thresholds; allow human override |
| Latency/cost | LLM-based evaluation adds overhead | Tier your evaluation; sample at scale |
| Context sensitivity | May misclassify domain-specific content | Custom fine-tuning; domain-specific rules |
| Adversarial vulnerability | LLM-based guards can be attacked | Defense in depth; multiple layers |
Implementation Recommendations¶
-
Layer your defenses: No single solution catches everything. Combine fast guardrails + LLM evaluation + human oversight.
-
Start simple: Begin with platform-native guardrails before custom solutions.
-
Test in your domain: Published benchmarks may not reflect your use case. Measure performance on your data.
-
Plan for false positives: Overly aggressive guardrails harm user experience. Build in human override paths.
-
Budget for evaluation: LLM-as-judge has real costs. Factor into architecture decisions.
-
Update continuously: Attacks evolve. Guardrails need regular updates.
Credits and Acknowledgments¶
This guide synthesizes publicly available documentation, research, and community feedback. Credit to:
- NVIDIA — NeMo Guardrails and documentation
- Meta — Llama Guard, Prompt Guard, and Llama Protections ecosystem
- AWS — Bedrock Guardrails documentation and best practices
- Microsoft — Azure AI Content Safety transparency notes
- Confident AI — DeepEval framework and documentation
- OWASP — LLM Top 10 and community contributions
- NIST — AI Risk Management Framework
- The broader AI safety research community
AI Runtime Behaviour Security, 2026 (Jonathan Gill).