AI Runtime Behaviour Security — Core¶
Implementing behavioral controls for AI systems in production.
This is the implementation companion to the Foundation overview. The Foundation explains the architecture and principles. This section contains the risk classification criteria, control definitions, checklists, and specialised controls you need to implement them.
Reading Order¶
Start with the essentials, then branch into specialised topics based on your deployment:
Essential (read in order): 1. Risk Tiers — classify your system 2. Risk Assessment — quantify control effectiveness and residual risk per tier 3. Controls — implement the three-layer pattern 4. Agentic — add controls if your agent has tool access 5. IAM Governance — identity, lifecycle, delegation 6. Judge Assurance — measure and calibrate the Judge 7. Checklist — track implementation progress
Specialised (read based on your deployment type):
| If you're deploying... | Read |
|---|---|
| Multimodal models (image, audio, video) | Multimodal Controls |
| Reasoning models (chain-of-thought) | Reasoning Model Controls |
| Streaming responses | Streaming Controls |
| Persistent memory or long context | Memory and Context |
| Multi-agent systems | Multi-Agent Controls then MASO |
| Open-weight / self-hosted models | Open-Weight Models |
PACE resilience (read after controls): - Control Layer Resilience — PACE for each control layer - PACE for Agentic AI — PACE for agentic deployments - PACE Checklist — verify your fail postures
The Fundamental Shift¶
Traditional software can be tested before deployment. AI cannot — not fully.
| Traditional Software | AI Systems |
|---|---|
| Deterministic outputs | Non-deterministic |
| Testable at design time | Emergent behavior |
| Known failure modes | Adversarial discovery |
The shift: From design-time assurance to runtime behavioral monitoring.
The Pattern¶
The industry is converging on three layers:
| Layer | Function | Timing |
|---|---|---|
| Guardrails | Block known-bad inputs/outputs | Real-time |
| Judge | Detect unknown-bad via LLM evaluation | Async |
| Human Oversight | Decide, act, remain accountable | As needed |
Guardrails prevent. Judge detects. Humans decide.
Where This Pattern Exists¶
This isn't theoretical. Production implementations include:
| Platform | Implementation |
|---|---|
| NVIDIA NeMo Guardrails | Input, dialog, retrieval, execution, output rails |
| LangChain | Middleware + human-in-the-loop |
| Guardrails AI | Open-source validator framework |
| Galileo | Eval-to-guardrail lifecycle |
| DeepEval | LLM-as-judge evaluation |
| AWS Bedrock Guardrails | Managed filtering |
| Azure AI Content Safety | Content moderation |
→ For detailed solution comparison, see Current Solutions
What's been missing: clear guidance on why this pattern is necessary and how to implement it proportionate to risk.
Scope¶
In: Custom LLM apps, AI decision support, document processing, agentic systems
Out: Vendor AI products, model training, data preparation
Quick Start¶
1. Classify Your System¶
| Tier | Profile | Examples |
|---|---|---|
| CRITICAL | Direct decisions, customer/financial/safety impact | Credit decisions, fraud blocking |
| HIGH | Significant influence, sensitive data | Customer service with account access |
| MEDIUM | Moderate impact, human review expected | Internal Q&A, document drafting |
| LOW | Minimal impact, non-sensitive | Public FAQ, suggestions |
2. Select Controls¶
| Control | LOW | MEDIUM | HIGH | CRITICAL |
|---|---|---|---|---|
| Input guardrails | Basic | Standard | Enhanced | Maximum |
| Output guardrails | Basic | Standard | Enhanced | Maximum |
| Judge evaluation | — | Sampling | All | All + real-time |
| Human review | Exceptions | Sampling | Risk-based | All significant |
3. Implement in Order¶
- Logging — Can't evaluate what you don't capture
- Basic guardrails — Block obvious attacks
- Judge in shadow mode — Evaluate without action
- HITL queues — Somewhere for findings to go
- Operationalise — Act on findings, tune continuously
Core Documents¶
| Document | Purpose |
|---|---|
| Risk Tiers | Classification criteria, control mapping |
| Risk Assessment | Quantitative control effectiveness, residual risk analysis, worked examples per tier |
| Controls | Guardrails, Judge, HITL implementation |
| Agentic | Additional controls for agents |
| IAM Governance | Identity governance, agent lifecycle, delegation, threats |
| Checklist | Implementation tracking |
| Emerging Controls | Multimodal, reasoning, streaming overview |
Specialized Controls¶
| Document | Purpose |
|---|---|
| Judge Assurance | Judge accuracy measurement and calibration |
| Multi-Agent Controls | Controls for multi-agent systems |
| Multimodal Controls | Controls for image, audio, and video AI |
| Memory and Context | Long context and persistent memory controls |
| Reasoning Model Controls | Controls for chain-of-thought reasoning models |
| Streaming Controls | Controls for real-time streaming outputs |
Analysis & Insights¶
| Document | Purpose |
|---|---|
| Oversight Readiness Problem | Why human-in-the-loop fails and how to fix it |
| When the Judge Can Be Fooled | Judge adversarial robustness |
| Open Weight Models Shift the Burden | Self-hosted model control implications |
| Future Considerations | Future framework scope |
PACE Sections¶
| Document | Purpose |
|---|---|
| PACE Controls Section | PACE framework — controls |
| PACE Agentic Section | PACE framework — agentic controls |
| PACE Checklist Section | PACE framework — implementation checklist |
Architecture Overview¶
Extensions¶
| Folder | Contents |
|---|---|
| regulatory/ | ISO 42001, EU AI Act mapping |
| technical/ | Bypass prevention, infrastructure, metrics |
| templates/ | Playbooks, threat models |
| examples/ | Worked examples |
Key Principles¶
- Match controls to risk — Don't over-engineer LOW tier systems
- Guardrails are necessary but not sufficient — They miss novel attacks and nuance
- Judge is assurance, not control — It detects; humans decide what to do
- Infrastructure beats instructions — Enforce technically, not via prompts
- Assume bypasses happen — Design for detection, not just prevention
- Humans remain accountable — AI assists; humans own outcomes
AI Runtime Behaviour Security, 2026 (Jonathan Gill).