Controls: Guardrails, Judge, and Human Oversight¶

1. Guardrails¶

Real-time controls that block known-bad inputs and outputs.

Input Guardrails¶

Control	What It Catches
Injection detection	Attempts to override system prompt
Encoding detection	Obfuscated attacks (Base64, hex, Unicode)
PII detection	Personal data in prompts
Content policy	Prohibited request types
Rate limiting	Abuse, enumeration
Length limits	Context stuffing

Processing flow:

Input → Decode → Normalise → Pattern Match → ML Classify → Pass/Block

Output Guardrails¶

Control	What It Catches
Content filtering	Harmful/inappropriate content
PII detection	Personal data leakage
Grounding check	Hallucination
Format validation	Malformed responses

Limitations¶

Guardrails catch known patterns. They miss: - Novel techniques - Semantic variations - Context-dependent violations - Subtle policy violations

This is why the Judge provides the second layer.

For practical implementation guidance — international PII detection, RAG ingestion filtering, secrets scanning, alerting design, and guardrail exception governance — see Practical Guardrails.

2. LLM-as-Judge¶

Async evaluation of interactions for quality and policy compliance.

→ For model selection guidance, see Judge Model Selection

What the Judge Does¶

Function	Description
Policy compliance	Did the AI follow guidelines?
Quality assessment	Accurate, helpful, appropriate?
Anomaly detection	Unusual patterns?
Risk flagging	What needs human review?

What the Judge Does NOT Do¶

Block transactions in real-time
Make final decisions
Replace human judgment

The Judge surfaces findings. Humans decide actions.

Architecture¶

Simple (low volume):

Interactions → Judge → Findings → HITL queue

Two-tier (high volume):

Interactions → Tier 1 (fast/cheap) → Flags only → Tier 2 (thorough) → HITL

Evaluation Criteria¶

Criterion	Scoring
Policy adherence	Pass / Minor / Major violation
Accuracy	Verified / Unverified / Incorrect
Appropriateness	Appropriate / Borderline / Inappropriate
Safety	Safe / Uncertain / Concerning

Output: PASS / REVIEW / ESCALATE

Deployment Phases¶

Phase	Action on Findings
Shadow	Log only, measure accuracy
Advisory	Surface to humans, learn from feedback
Operational	Findings drive workflows

Start in shadow mode. Validate accuracy before acting.

Accuracy¶

The Judge will make mistakes.

Error	Impact	Mitigation
False positive	Unnecessary review	Tune prompts
False negative	Missed violations	Human sampling

Target: >90% agreement with human reviewers.

3. Human Oversight (HITL)¶

Humans review findings, make decisions, remain accountable.

HITL Architecture

Triggers¶

Trigger	Response
Judge flag	Review interaction
Guardrail block	Review if legitimate
User escalation	Human takes over
Sampling	Quality assurance
Threshold breach	Investigate pattern

Queue Design¶

Queue	SLA	Reviewer
Critical	1h	Senior + expert
High	4h	Domain expert
Standard	24h	Trained reviewer
Sampling	72h	QA team

Actions¶

Action	When
Approve	Interaction appropriate
Correct	Minor issue, fixable
Escalate	Needs senior review
Block user	Abuse detected
Tune	False positive

Prevent Rubber-Stamping¶

Control	Purpose
Canary cases	Verify reviewers catch known-bad
Time tracking	Flag too-fast reviews
Volume limits	Prevent fatigue
Inter-rater checks	Measure consistency

Going Deeper¶

Topic	Document
What these controls cost in production	Cost & Latency — latency budgets, sampling strategies, tiered evaluation cascade
Judge accuracy, drift, and adversarial failure	Judge Assurance · When the Judge Can Be Fooled
Practical guardrail configurations	Practical Guardrails — what to turn on first, encoding detection, international PII
Controls for multi-agent systems	MASO Framework — 93 controls across 6 domains for agent orchestration
Controls for reasoning models (o1, etc.)	Reasoning Model Controls — trace scanning, instruction adherence, consistency checks

Implementation Order¶

Logging — Can't evaluate what you don't capture
Basic guardrails — Block obvious attacks
Judge in shadow — Evaluate without action
HITL queues — Somewhere for findings
Judge advisory — Surface to humans
Enhanced guardrails — Add ML detection
Judge operational — Drive workflows
Continuous tuning — Improve from findings

AI Runtime Behaviour Security, 2026 (Jonathan Gill).