AI Engineers¶
ML Engineers, AI Developers, Data Scientists, Platform Engineers — implementation patterns, not governance theory.
The Problem You Have¶
You're building AI systems. Your security and risk teams have requirements that sound like governance bureaucracy. You've been asked for "guardrails," "a Judge," "human oversight," "PACE resilience" — but what you actually need is:
- What do I implement? Concrete patterns, not abstract principles.
- Where do I put it? Architecture-level placement in the pipeline.
- How do I test it? Verification that controls actually work.
- What breaks if I get it wrong? Failure modes you need to handle.
- What already exists? Libraries, services, and platform features you can use instead of building from scratch.
What This Framework Gives You¶
The three things you're building¶
Every AI system needs some combination of these. Your risk tier determines how much:
1. Guardrails — input/output filters that run on every request.
What you're implementing: - Input: injection detection, content policy check, PII redaction, schema validation - Output: hallucination check (ground against source), PII scan, toxicity filter, format validation - Latency budget: ~10-20ms total - Libraries: NVIDIA NeMo Guardrails, Guardrails AI, LangChain output parsers, AWS Bedrock Guardrails, Azure AI Content Safety
2. LLM-as-Judge — an independent LLM that evaluates your task agent's output.
What you're implementing: - A separate model (different from your task agent) that receives the input, output, and context - A structured evaluation prompt that checks policy compliance, factual grounding, safety, quality - A scoring/classification response (pass/fail/escalate with confidence) - Routing logic: pass → deliver, fail → block, low-confidence → human review queue
Key constraint: the Judge must use a different model than your task agent. Same-model evaluation has correlated failure modes. If GPT-4 hallucinates a fact, GPT-4 evaluating that fact has a higher chance of missing it than Claude evaluating it, and vice versa.
- Implementation guide: LLM-as-Judge Implementation
- Prompt examples: Judge Prompt Examples
- Model selection: Judge Model Selection
- Calibration: Judge Assurance
3. Circuit breaker / PACE fail postures — what your system does when control layers fail.
What you're implementing: - Health checks on guardrail and Judge services - Fallback routing when each layer is unavailable - A kill switch that removes the AI from the path entirely - Pre-defined degradation: full service → limited scope → human-only → static fallback
This is infrastructure code, not AI code. Treat it like any service reliability pattern.
Implementation by risk tier¶
| Tier | What You Build | Judge Configuration | Human Oversight |
|---|---|---|---|
| LOW | Basic input/output guardrails | Optional — 1-5% sampling for monitoring | None (exception-based) |
| MEDIUM | Standard guardrails + Judge integration | 5-10% sampling, batch evaluation | Review flagged items only |
| HIGH | Full guardrail suite + Judge + routing | 20-50% coverage, near real-time | Flagged items + random sampling |
| CRITICAL | Hardened guardrails + Judge + human gate | 100% coverage, synchronous (blocks delivery) | All high-impact decisions reviewed |
Platform-specific patterns¶
If you're building on a specific platform, these map framework controls to platform services:
| Platform | Pattern Guide | Key Services |
|---|---|---|
| AWS Bedrock | AWS Bedrock Patterns | Bedrock Guardrails, CloudWatch, IAM |
| Azure AI | Azure AI Patterns | Azure AI Content Safety, Responsible AI toolkit |
| Databricks | Databricks Patterns | MLflow, Unity Catalog, Model Serving |
| LangChain / LangGraph | Integration Guide | LangSmith, callbacks, output parsers |
Testing your controls¶
Controls that aren't tested don't work. The framework provides:
- Testing Guidance — structured test scenarios for each control layer
- Red Team Playbook — 13 adversarial scenarios (prompt injection, data exfiltration, privilege escalation, consensus manipulation)
- Judge Assurance — how to measure Judge accuracy, calibrate confidence thresholds, detect drift
- When the Judge Can Be Fooled — failure modes specific to the evaluation layer
Your Starting Path¶
| # | Document | Why You Need It |
|---|---|---|
| 1 | Controls | Three-layer implementation reference — what to build |
| 2 | Quick Start | Zero to working controls in 30 minutes |
| 3 | LLM-as-Judge Implementation | Judge layer patterns, prompts, routing logic |
| 4 | Judge Assurance | How to measure and calibrate Judge accuracy |
| 5 | Checklist | Track what you've implemented |
If you're building agents: Agentic Controls — tool scoping, action classification, confirmation gates.
If you're building multi-agent systems: MASO Integration Guide — message bus signing, per-agent identity, cross-agent DLP.
If you're building RAG: RAG Security — the attack surface you probably haven't considered.
What You Can Do Monday Morning¶
-
Add input guardrails. If you have no controls today, start with injection detection on input. NVIDIA NeMo, Guardrails AI, or your platform's built-in content safety. This alone catches ~90% of known-pattern attacks.
-
Add output grounding. If your system uses RAG, validate that the response is actually grounded in the retrieved documents. This catches hallucinated facts before they reach users.
-
Implement a Judge on 10% of traffic. Pick a different model from your task agent. Use the Judge Prompt Examples as starting points. Log results. Measure the catch rate. This tells you what your guardrails are missing.
-
Wire a circuit breaker. If your guardrail service goes down, your system should degrade to a safe state — not continue without protection. A simple health check and fallback route takes an afternoon.
-
Red team your own system. Spend an hour trying to break it. The Red Team Playbook has structured scenarios. Document what you find. This is the most effective way to identify control gaps.
Common Objections — With Answers¶
"The Judge adds latency to every request." Only for CRITICAL tier. For HIGH tier, run it asynchronously — it doesn't block the response. For MEDIUM tier, run it on a sample. For LOW tier, it's optional. See Cost & Latency.
"Our model is already aligned / fine-tuned / safe." Model alignment is necessary but insufficient. Alignment reduces the base rate of harmful outputs but doesn't eliminate it. Prompt injection bypasses alignment. RAG poisoning bypasses alignment. Edge cases that weren't in the training data bypass alignment. Runtime controls catch what alignment misses.
"We don't have budget for a second model (the Judge)." The Judge doesn't have to be expensive. A smaller, faster model (Haiku-class) running a focused evaluation prompt often outperforms a larger model for specific policy checks. Sample at 10% to start. The Judge Model Selection guide covers cost-effective configurations.
"Human oversight doesn't scale." Correct — which is why the framework doesn't require human review of every transaction (except at CRITICAL tier). The Judge handles scale. Humans handle the edge cases the Judge flags and the random samples that keep the system honest. See Humans Remain Accountable.
"This is security's job, not mine." Security sets the requirements. You implement them. The framework gives you concrete patterns so you're not guessing. The Controls document tells you exactly what to build. The Checklist tracks your progress. Security reviews the result, not the implementation approach.
AI Runtime Behaviour Security, 2026 (Jonathan Gill).