Controls: Guardrails, Judge, and Human Oversight¶
1. Guardrails¶
Real-time controls that block known-bad inputs and outputs.
Input Guardrails¶
| Control | What It Catches |
|---|---|
| Injection detection | Attempts to override system prompt |
| Encoding detection | Obfuscated attacks (Base64, hex, Unicode) |
| PII detection | Personal data in prompts |
| Content policy | Prohibited request types |
| Rate limiting | Abuse, enumeration |
| Length limits | Context stuffing |
Processing flow:
Input → Decode → Normalise → Pattern Match → ML Classify → Pass/Block
Output Guardrails¶
| Control | What It Catches |
|---|---|
| Content filtering | Harmful/inappropriate content |
| PII detection | Personal data leakage |
| Grounding check | Hallucination |
| Format validation | Malformed responses |
Limitations¶
Guardrails catch known patterns. They miss: - Novel techniques - Semantic variations - Context-dependent violations - Subtle policy violations
This is why the Judge provides the second layer.
For practical implementation guidance — international PII detection, RAG ingestion filtering, secrets scanning, alerting design, and guardrail exception governance — see Practical Guardrails.
2. LLM-as-Judge¶
Async evaluation of interactions for quality and policy compliance.
→ For model selection guidance, see Judge Model Selection
What the Judge Does¶
| Function | Description |
|---|---|
| Policy compliance | Did the AI follow guidelines? |
| Quality assessment | Accurate, helpful, appropriate? |
| Anomaly detection | Unusual patterns? |
| Risk flagging | What needs human review? |
What the Judge Does NOT Do¶
- Block transactions in real-time
- Make final decisions
- Replace human judgment
The Judge surfaces findings. Humans decide actions.
Architecture¶
Simple (low volume):
Interactions → Judge → Findings → HITL queue
Two-tier (high volume):
Interactions → Tier 1 (fast/cheap) → Flags only → Tier 2 (thorough) → HITL
Evaluation Criteria¶
| Criterion | Scoring |
|---|---|
| Policy adherence | Pass / Minor / Major violation |
| Accuracy | Verified / Unverified / Incorrect |
| Appropriateness | Appropriate / Borderline / Inappropriate |
| Safety | Safe / Uncertain / Concerning |
Output: PASS / REVIEW / ESCALATE
Deployment Phases¶
| Phase | Action on Findings |
|---|---|
| Shadow | Log only, measure accuracy |
| Advisory | Surface to humans, learn from feedback |
| Operational | Findings drive workflows |
Start in shadow mode. Validate accuracy before acting.
Accuracy¶
The Judge will make mistakes.
| Error | Impact | Mitigation |
|---|---|---|
| False positive | Unnecessary review | Tune prompts |
| False negative | Missed violations | Human sampling |
Target: >90% agreement with human reviewers.
3. Human Oversight (HITL)¶
Humans review findings, make decisions, remain accountable.
Triggers¶
| Trigger | Response |
|---|---|
| Judge flag | Review interaction |
| Guardrail block | Review if legitimate |
| User escalation | Human takes over |
| Sampling | Quality assurance |
| Threshold breach | Investigate pattern |
Queue Design¶
| Queue | SLA | Reviewer |
|---|---|---|
| Critical | 1h | Senior + expert |
| High | 4h | Domain expert |
| Standard | 24h | Trained reviewer |
| Sampling | 72h | QA team |
Actions¶
| Action | When |
|---|---|
| Approve | Interaction appropriate |
| Correct | Minor issue, fixable |
| Escalate | Needs senior review |
| Block user | Abuse detected |
| Tune | False positive |
Prevent Rubber-Stamping¶
| Control | Purpose |
|---|---|
| Canary cases | Verify reviewers catch known-bad |
| Time tracking | Flag too-fast reviews |
| Volume limits | Prevent fatigue |
| Inter-rater checks | Measure consistency |
Going Deeper¶
| Topic | Document |
|---|---|
| What these controls cost in production | Cost & Latency — latency budgets, sampling strategies, tiered evaluation cascade |
| Judge accuracy, drift, and adversarial failure | Judge Assurance · When the Judge Can Be Fooled |
| Practical guardrail configurations | Practical Guardrails — what to turn on first, encoding detection, international PII |
| Controls for multi-agent systems | MASO Framework — 93 controls across 6 domains for agent orchestration |
| Controls for reasoning models (o1, etc.) | Reasoning Model Controls — trace scanning, instruction adherence, consistency checks |
Implementation Order¶
- Logging — Can't evaluate what you don't capture
- Basic guardrails — Block obvious attacks
- Judge in shadow — Evaluate without action
- HITL queues — Somewhere for findings
- Judge advisory — Surface to humans
- Enhanced guardrails — Add ML detection
- Judge operational — Drive workflows
- Continuous tuning — Improve from findings
AI Runtime Behaviour Security, 2026 (Jonathan Gill).