Current AI Safety Solutions¶

A reference guide to production-ready guardrail, evaluation, and safety solutions implementing the three-layer pattern.

Solutions At a Glance¶

Solution	Type	What It Does	Layer	Open Source	Key Limitation
AWS Bedrock Guardrails	Managed	Content filtering, PII detection, hallucination checks, denied topics	Guardrails	No	30 denied topic limit; cross-region IAM issues
Azure AI Content Safety	Managed	Harm classification (0-7 severity), prompt shields, groundedness	Guardrails	No	English-optimized; 10K char limit per request
NVIDIA NeMo Guardrails	Framework	Programmable rails (input/output/dialog/retrieval/execution)	Guardrails	Yes	Dialog rails don't work with reasoning models
Guardrails AI	Framework	Output validation, structured output enforcement, retry logic	Guardrails	Yes	Output-focused; less input validation
Llama Guard 3/4	Model	LLM-based content classification (safe/unsafe + category)	Guardrails/Judge	Yes	~33% attack bypass rate; English-optimized
OpenAI Moderation API	API	Harm classification across categories	Guardrails	No	OpenAI models only; limited customization
DeepEval	Framework	LLM-as-judge evaluation, 50+ metrics, CI/CD integration	Judge	Yes	LLM calls add cost/latency at scale
Galileo	Platform	Eval-to-guardrail lifecycle, Luna models for monitoring	Judge	No	Platform dependency
Prompt Guard (Meta)	Model	Prompt injection and jailbreak detection	Guardrails	Yes	Needs fine-tuning for best results
LlamaFirewall (Meta)	Tool	Security guardrail for AI systems	Guardrails	Yes	Early stage

Solutions by Use Case¶

If You Need...	Primary Choice	Alternative
Turnkey AWS guardrails	AWS Bedrock Guardrails	—
Turnkey Azure guardrails	Azure AI Content Safety	—
Self-hosted, customizable	NVIDIA NeMo Guardrails	Guardrails AI
Open-source safety model	Llama Guard 3/4	Prompt Guard
LLM evaluation/testing	DeepEval	Galileo
Production monitoring	Confident AI (DeepEval)	Galileo
Structured output validation	Guardrails AI	NeMo Guardrails
Multimodal content safety	Azure AI Content Safety	Llama Guard 4
Hallucination detection	AWS Bedrock (Automated Reasoning)	DeepEval metrics

Solutions by Layer¶

Guardrails Layer (Real-time, ~10-100ms)¶

Solution	Input	Output	Multimodal	Customizable	Self-Hosted
AWS Bedrock Guardrails	✓	✓	Images (preview)	Limited	No
Azure AI Content Safety	✓	✓	✓	Custom categories	No
NVIDIA NeMo Guardrails	✓	✓	Limited	Highly	Yes
Guardrails AI	Limited	✓	No	Highly	Yes
Llama Guard	✓	✓	Llama Guard 4	Via prompting	Yes
OpenAI Moderation	✓	✓	No	No	No

Judge Layer (Async, ~500ms-5s)¶

Solution	Metrics	Custom Criteria	Production Monitoring	CI/CD
DeepEval	50+	G-Eval, DAG	Via Confident AI	✓
Galileo	Multiple	✓	Built-in	✓
Custom LLM prompts	Unlimited	✓	DIY	DIY

Industry Context¶

The AI security industry has converged on a common pattern: layered runtime controls combining fast filtering (guardrails), deeper evaluation (LLM-as-judge), and human oversight. This guide catalogs the major solutions implementing this pattern, with honest assessments of capabilities, limitations, and appropriate use cases.

This page exists to give credit where it's due and help practitioners select appropriate tools. The Framework synthesizes and explains the pattern these solutions implement.

Quick Reference: Solution Categories¶

Category	Purpose	Examples
Platform Guardrails	Cloud-native filtering integrated with AI services	AWS Bedrock Guardrails, Azure AI Content Safety
Open-Source Frameworks	Self-hosted, customizable guardrail systems	NVIDIA NeMo Guardrails, Guardrails AI
Safety Models	LLM-based content moderation	Llama Guard, OpenAI Moderation API
Evaluation Frameworks	LLM-as-Judge implementation	DeepEval, Galileo
Standards & Guidance	Risk frameworks and taxonomies	OWASP LLM Top 10, NIST AI RMF

Platform Guardrails¶

AWS Bedrock Guardrails¶

Overview: Managed guardrail service integrated with Amazon Bedrock foundation models. Provides content filtering, PII detection, denied topics, and (uniquely) automated reasoning checks for hallucination detection.

How It Works: - Evaluates both user inputs and model responses against configured policies - Six safeguard types: content filters, denied topics, word filters, sensitive info, contextual grounding, automated reasoning - Can be used via API without invoking the model (ApplyGuardrail API) - Works with any model (Bedrock-hosted or external via API)

Strengths: - Automated Reasoning checks claim 99% accuracy for hallucination detection (AWS claim) - Blocks up to 88% of harmful content (AWS benchmark) - Native integration with Bedrock agents, knowledge bases, and flows - Cross-model consistency — same guardrails work across different FMs

Limitations: - Cross-region complexity: Known IAM permission issues when guardrails and agents are in different regions - Input tagging limitations: Not currently supported with managed prompts - Latency cost: Adds processing time; charges apply even when blocking input - 30 denied topic limit: May be insufficient for complex policy sets

Known Issues: - Access denied errors when using cross-region guardrails with Bedrock Agents (requires careful IAM configuration) - VPC endpoint limitations for cross-region access - Streaming not fully supported with all guardrail configurations

Best For: Organizations already using AWS Bedrock who want turnkey guardrails with minimal setup.

Not Recommended For: Complex multi-region deployments without careful IAM planning; use cases requiring more than 30 denied topics.

Pricing: Per 1,000 text units (1,000 characters each). Word filters free. See AWS Pricing.

Documentation: AWS Bedrock Guardrails

Azure AI Content Safety¶

Overview: Microsoft's content moderation service providing text and image analysis with severity scoring across harm categories.

How It Works: - Multi-class classification for hate, violence, sexual content, self-harm - Severity levels 0-7 for text, 0-3 for images - Prompt Shields for jailbreak and injection detection - Groundedness detection for hallucination - Protected material detection for copyright

Strengths: - Multimodal support (text, images, text+image) - Granular severity scoring (not just binary) - Custom categories API for domain-specific content - Integration with Azure OpenAI and Foundry - Protected material detection for copyright compliance

Limitations: - Language support: Optimized for English; performance varies for other languages (German, Japanese, Spanish, French, Italian, Portuguese, Chinese supported) - 10K character limit: Per submission for text moderation - Image recognition limits: May miss content in unclear or edited images - Cannot detect CSAM: Explicitly stated limitation - Evolving threats: May not keep pace with new attack techniques

Known Issues: - False positives reported in scientific/medical contexts (pharmaceutical companies report legitimate content being flagged) - Groundedness detection inconsistencies (some users report it returns empty results) - Content filter token costs can be significant (reported 10x other costs in some deployments)

Best For: Microsoft Azure customers needing content moderation with severity scoring and multimodal support.

Not Recommended For: Non-English content at scale; scientific/medical applications without custom configuration.

Pricing: Per text record (1,000 characters) and per image. See Azure Pricing.

Documentation: Azure AI Content Safety

Open-Source Frameworks¶

NVIDIA NeMo Guardrails¶

Overview: Open-source Python library for adding programmable guardrails to LLM applications. Highly customizable with support for multiple rail types and integration with major LLM providers.

How It Works: - Five rail types: input, dialog, retrieval, execution, output - Colang 2.0 DSL for defining conversational flows - Can orchestrate multiple rails with configurable execution order - Supports GPU acceleration for low-latency performance

Strengths: - Highly programmable: Colang DSL allows complex policy logic - Multi-rail orchestration: Coordinate input, dialog, retrieval, execution, and output rails - LLM provider agnostic: Works with OpenAI, Azure, Anthropic, HuggingFace, NIM - LangChain/LangGraph integration: Native support for popular frameworks - GPU acceleration: NVIDIA hardware optimization for performance

Limitations: - Learning curve: Colang DSL requires learning - LLM dependency: Most rails require an LLM for evaluation (adds latency/cost) - Dialog rails not supported with reasoning models: Documented limitation - Built-in rails may not suit production: NVIDIA explicitly states "may or may not be suitable for a given production use case"

Known Issues: - Jailbreak detection container setup issues reported (GitHub Issue #690) - Reasoning traces can interfere with guardrails, triggering false positives - Threads not supported in streaming mode - No automatic thread cleanup mechanism

Vendor Recommendation: NVIDIA states developers should "work with their internal application team to ensure guardrails meets [their] requirements" — tune for your use case.

Best For: Teams needing highly customizable, self-hosted guardrails with complex policy logic.

Not Recommended For: Simple use cases where managed services suffice; teams without Python/ML expertise.

License: Apache 2.0

Documentation: NeMo Guardrails Docs

GitHub: github.com/NVIDIA/NeMo-Guardrails

Guardrails AI¶

Overview: Open-source Python framework for adding structural and semantic validation to LLM outputs. Focus on output validation with a library of reusable validators.

How It Works: - Define "guards" that validate LLM outputs - Validator library (Guardrails Hub) with pre-built checks - Supports structured output validation (JSON, etc.) - Can retry/reask on validation failure

Strengths: - Validator ecosystem: Large library of pre-built validators - Structured output focus: Strong at ensuring output format compliance - Retry logic: Automatic correction on validation failure - Simple API: Easy to integrate

Limitations: - Output-focused: Less comprehensive for input validation - LLM dependency: Many validators require LLM calls - Limited multimodal: Primarily text-focused

Best For: Applications requiring structured LLM outputs with validation; RAG pipelines needing output quality checks.

License: Apache 2.0

Documentation: guardrailsai.com

GitHub: github.com/guardrails-ai/guardrails

Safety Models¶

Meta Llama Guard¶

Overview: LLM-based input/output moderation model from Meta, fine-tuned for safety classification. Available in multiple versions (Llama Guard 1, 2, 3, 4) with evolving capabilities.

How It Works: - Fine-tuned Llama model that classifies content as safe/unsafe - Outputs category of violation when unsafe - Instruction-tunable — can adapt to custom taxonomies via prompting - Available in quantized versions for lower deployment cost

Versions: | Version | Base Model | Languages | Categories | |---------|------------|-----------|------------| | Llama Guard 3 | Llama 3 | 8 languages | 14 (MLCommons taxonomy) | | Llama Guard 4 | Llama 4 Scout (12B) | Multilingual | MLCommons + custom |

Strengths: - Open weights: Self-hostable, customizable - Instruction-tunable: Adapt to custom policies via prompting - MLCommons aligned: Standard taxonomy for interoperability - Multilingual: Llama Guard 3+ supports 8 languages - Tool use awareness: Can detect code interpreter abuse

Limitations: - English-optimized: Performance varies in other languages - Context sensitivity: May flag therapeutic discussions of self-harm - Adversarial vulnerability: As an LLM, susceptible to prompt injection - False positive rate: May increase refusals to benign prompts - Attack bypass rate: Independent testing shows ~33% of attacks bypass protection

Known Issues: - Llama Guard is an LLM and can be prompted to generate any text (not just classifications) - Performance on custom taxonomies requires fine-tuning for best results - Longer context windows can reduce guardrail effectiveness

Meta's Recommendation: "There is no one-size-fits-all guardrail detection to prevent all risks. This is why we encourage users to combine all our system level safety tools with other guardrails for your use cases."

Best For: Organizations wanting self-hosted safety classification with customization capability.

Not Recommended For: Production use without additional guardrail layers; non-English deployments without testing.

License: Llama Community License (requires "Built with Llama" attribution)

Documentation: Llama Protections

Models: Llama Guard 3 on HuggingFace

OpenAI Moderation API¶

Overview: OpenAI's content moderation endpoint for detecting harmful content in text.

How It Works: - API endpoint that classifies text across harm categories - Returns category flags and confidence scores - Free to use for OpenAI API customers

Strengths: - Free: No additional cost for OpenAI customers - Simple API: Single endpoint, easy integration - Fast: Low latency classification

Limitations: - OpenAI ecosystem only: Designed for OpenAI models - Text only: No multimodal support - Limited customization: Cannot adapt categories - English-focused: Performance varies in other languages

Best For: Quick content filtering for OpenAI-based applications.

Documentation: OpenAI Moderation

Evaluation Frameworks (LLM-as-Judge)¶

DeepEval / Confident AI¶

Overview: Open-source LLM evaluation framework providing pytest-like testing for LLM applications. Supports both development-time benchmarking and production monitoring.

How It Works: - Define test cases with inputs, outputs, and expected behaviors - Run metrics (G-Eval, hallucination, relevancy, etc.) against outputs - LLM-as-judge approach for most metrics - Integrates with CI/CD pipelines - Confident AI cloud platform for collaboration and monitoring

Strengths: - Comprehensive metrics: 50+ research-backed evaluation metrics - Flexible: Works with any LLM provider - Production monitoring: Async evals in production via Confident AI - Red teaming: Built-in adversarial testing for 40+ vulnerabilities - Component-level evals: Can evaluate individual pipeline components

Limitations: - LLM dependency: Most metrics require LLM calls (cost, latency) - Rate limits: Can hit LLM provider limits during large evaluations - Metric-outcome fit: Metrics may not correlate with business outcomes without calibration

Known Issues: - Rate limit errors common during large evaluations - False positives/negatives require metric tuning - Production evals need async architecture to avoid blocking

Best For: Teams needing comprehensive LLM evaluation with CI/CD integration.

License: Apache 2.0 (open-source); Confident AI platform has free and paid tiers

Documentation: deepeval.com

GitHub: github.com/confident-ai/deepeval

Galileo¶

Overview: LLM evaluation platform with "eval-to-guardrail" lifecycle — evaluations developed in testing become production guardrails.

How It Works: - Define evaluation criteria during development - Test against datasets to calibrate - Deploy same evals as production guardrails - Luna models provide low-cost monitoring

Strengths: - Unified lifecycle: Evals → Guardrails workflow - Low-cost monitoring: Luna models for production - Observability: Built-in tracing and debugging

Limitations: - Platform dependency: Requires Galileo platform - Proprietary: Less flexibility than open-source options

Best For: Teams wanting integrated eval-to-production workflow.

Documentation: rungalileo.io

Standards and Guidance¶

OWASP LLM Top 10 (2025)¶

Overview: Industry-standard taxonomy of security risks for LLM applications, maintained by OWASP with 500+ contributors.

Categories: 1. Prompt Injection 2. Insecure Output Handling 3. Training Data Poisoning 4. Model Denial of Service 5. Supply Chain Vulnerabilities 6. Sensitive Information Disclosure 7. Insecure Plugin Design 8. Excessive Agency 9. Overreliance 10. Model Theft

Also Available: OWASP Top 10 for Agentic Applications (December 2025)

Use For: Risk identification, security assessments, compliance documentation.

Documentation: owasp.org/www-project-top-10-for-large-language-model-applications

NIST AI Risk Management Framework¶

Overview: US government framework for AI risk management with four core functions: Govern, Map, Measure, Manage.

Use For: Enterprise AI governance, federal compliance, risk assessment structure.

Documentation: nist.gov/itl/ai-risk-management-framework

ISO 42001¶

Overview: International standard for AI management systems. Certifiable.

Use For: Formal AI governance certification, enterprise compliance.

Documentation: iso.org/standard/81230.html

Emerging Solutions¶

LlamaFirewall (Meta)¶

Security guardrail tool for building secure AI systems. Part of Meta's Llama Protections suite.

Prompt Guard (Meta)¶

Multi-label classifier for detecting prompt injections and jailbreaks. Available in 86M and 22M parameter versions.

CyberSecEval (Meta)¶

Benchmarks for measuring LLM cybersecurity risks and defensive capabilities.

Lasso Security Secure Gateway¶

Model-agnostic security gateway providing guardrails across any AI platform.

Solution Selection Guide¶

If You Need...	Consider
Turnkey AWS integration	AWS Bedrock Guardrails
Turnkey Azure integration	Azure AI Content Safety
Highly customizable, self-hosted	NVIDIA NeMo Guardrails
Output validation focus	Guardrails AI
Self-hosted safety model	Llama Guard
LLM evaluation/testing	DeepEval
Integrated eval-to-production	Galileo
Risk taxonomy	OWASP LLM Top 10
Governance framework	NIST AI RMF, ISO 42001

Common Limitations Across Solutions¶

Limitation	Description	Mitigation
Language coverage	Most optimized for English	Test non-English thoroughly; consider translation layers
Novel attacks	Pattern-based detection misses new techniques	Combine with behavioral monitoring; update regularly
False positives	Over-blocking legitimate content	Tune thresholds; allow human override
Latency/cost	LLM-based evaluation adds overhead	Tier your evaluation; sample at scale
Context sensitivity	May misclassify domain-specific content	Custom fine-tuning; domain-specific rules
Adversarial vulnerability	LLM-based guards can be attacked	Defense in depth; multiple layers

Implementation Recommendations¶

Layer your defenses: No single solution catches everything. Combine fast guardrails + LLM evaluation + human oversight.
Start simple: Begin with platform-native guardrails before custom solutions.
Test in your domain: Published benchmarks may not reflect your use case. Measure performance on your data.
Plan for false positives: Overly aggressive guardrails harm user experience. Build in human override paths.
Budget for evaluation: LLM-as-judge has real costs. Factor into architecture decisions.
Update continuously: Attacks evolve. Guardrails need regular updates.

Credits and Acknowledgments¶

This guide synthesizes publicly available documentation, research, and community feedback. Credit to:

NVIDIA — NeMo Guardrails and documentation
Meta — Llama Guard, Prompt Guard, and Llama Protections ecosystem
AWS — Bedrock Guardrails documentation and best practices
Microsoft — Azure AI Content Safety transparency notes
Confident AI — DeepEval framework and documentation
OWASP — LLM Top 10 and community contributions
NIST — AI Risk Management Framework
The broader AI safety research community

AI Runtime Behaviour Security, 2026 (Jonathan Gill).