AI Runtime Behaviour Security — Single-Agent Controls¶
Runtime behavioural security for single-model AI deployments. Guardrails, LLM-as-Judge, and human oversight — scaled to the risk.
Part of the AI Runtime Behaviour Security Version 1.0 · February 2026 · Jonathan Gill
How This Section Is Organised¶
This page is the conceptual overview — it explains the architecture, the risk-scaling model, and how the pieces connect. The implementation details — risk classification criteria, specific control definitions, checklists, and specialised controls for multimodal, reasoning, streaming, and memory — live in the Core directory.
| If you want to... | Go here |
|---|---|
| Understand the architecture and principles | You're in the right place — keep reading |
| Classify a system and select controls | Core: Risk Tiers → Controls |
| See the implementation checklist | Core: Checklist |
| Read specialised controls (multimodal, reasoning, streaming, memory) | Core: Specialised Controls |
Architecture¶
Three layers, one principle: you can't fully test a non-deterministic system before deployment, so you continuously verify behaviour in production.
Layer 1 — Guardrails block known-bad inputs and outputs at machine speed (~10ms). Deterministic pattern matching: content filters, PII detection, topic restrictions, rate limits. Every request passes through. No exceptions.
Layer 2 — LLM-as-Judge catches unknown-bad through independent model evaluation (~500ms–5s). A separate LLM evaluates task agent outputs against policy, factual grounding, tone, and safety criteria. Catches what guardrails can't pattern-match.
Layer 3 — Human Oversight provides the accountability backstop. Scope scales with risk: low-risk systems get spot checks, high-risk systems get human approval before commit. Humans decide edge cases. Humans own outcomes.
Circuit Breaker stops all AI traffic and activates a non-AI fallback when any layer fails. Not a degradation — a full stop with a predetermined safe state.
This pattern already exists in production at major platforms: NVIDIA NeMo, AWS Bedrock, Azure AI, LangChain, Guardrails AI, and others. This framework provides the vendor-neutral implementation: risk classification, controls, fail postures, and tested fallback paths.
Get Started¶
| If you want to... | Go here |
|---|---|
| Get the whole framework on one page | Cheat Sheet / Decision Poster |
| Deploy low-risk AI fast | Fast Lane |
| Understand the concepts in 30 minutes | Quick Start |
| Implement controls with working code | Implementation Guide |
| Classify a system by risk | Risk Tiers |
| Deploy an agentic AI system | Agentic Controls |
| Understand what happens when controls fail | PACE Resilience |
| Enforce controls at the infrastructure layer | Infrastructure Controls |
| Track your implementation | Checklist |
| Secure a multi-agent system | MASO Framework |
Before You Build Controls¶
The First Control: Choosing the Right Tool
The most effective way to reduce AI risk is to not use AI where it doesn't belong. Before guardrails, judges, or human oversight — ask whether AI is the right tool for this problem.
If your deployment is internal, read-only, handles no regulated data, and has a human reviewing output — start with the Fast Lane. You may not need the rest.
Risk-Scaled Controls¶
Controls scale to risk so low-risk AI moves fast and high-risk AI stays safe.
| Risk Tier | Controls Required | PACE Posture | Use Case Examples |
|---|---|---|---|
| Low | Fast Lane: minimal guardrails, self-certification | P only (fail-open with logging) | Internal chatbots, document summarisation, code assistance |
| Medium | Guardrails + Judge, periodic human review | P + A configured | Customer-facing content, recommendation engines, search |
| High | All three layers, human-in-the-loop for writes | P + A + C configured and tested | Financial advice, medical support, regulatory decisions |
| Critical | Full architecture, mandatory human approval | Full PACE cycle with tested E→P recovery | Autonomous actions on regulated data, safety-critical systems |
Classify your system: Risk Tiers
PACE Resilience¶
Every control has a defined failure mode. The PACE methodology ensures that when a control layer degrades — and it will — the system fails safely rather than silently.
Primary: All layers operational. Normal production.
Alternate: One layer degraded. Backup activated. Scope tightened. Example: Judge layer is down → guardrails remain active, all outputs queued for human review.
Contingency: Multiple layers degraded. AI operates in supervised-only mode. Human approves every action. Reduced capacity, high assurance.
Emergency: Confirmed compromise or cascading failure. Circuit breaker fires. AI traffic stopped. Non-AI fallback activated. Incident response engaged.
Even at the lowest risk tier, there's a fallback plan. At the highest, there's a structured degradation path from full autonomy to full stop.
Core Documents¶
| Document | Purpose |
|---|---|
| Cheat Sheet | Entire framework on one page — classify, control, fail posture, test |
| Decision Poster | Visual one-page reference |
| Fast Lane | Pre-approved minimal controls for low-risk AI |
| Risk Tiers | Classify your system, determine control and resilience requirements |
| Risk Assessment | Quantitative control effectiveness, residual risk per tier, NIST AI RMF aligned |
| Controls | Guardrails, Judge, and Human Oversight implementation with per-layer fail postures |
| Agentic | Controls for single autonomous AI agents including graceful degradation path |
| PACE Resilience | What happens when controls fail |
| Checklist | Track implementation and PACE verification progress |
| Emerging Controls | Multimodal, reasoning, and streaming considerations (theoretical) |
Infrastructure Controls¶
This framework defines what to enforce. The infrastructure section defines how — 80 technical controls across 11 domains, with standards mappings and platform-specific patterns.
Domains: Identity & Access Management (8), Logging & Observability (10), Network & Segmentation (8), Data Protection (8), Secrets & Credentials (8), Supply Chain (8), Incident Response (8), Tool Access (6), Session & Scope (5), Delegation Chains (5), Sandbox Patterns (6).
Standards mappings: Every control maps to the three-layer model, ISO 42001 Annex A, NIST AI RMF, and OWASP LLM/Agentic Top 10.
Platform patterns: AWS Bedrock, Azure AI, and Databricks reference architectures.
When You Need Multi-Agent¶
When AI agents collaborate, delegate tasks, and take autonomous actions across trust boundaries, the single-agent controls on this page are necessary but not sufficient. The MASO Framework extends this architecture into multi-agent orchestration.
| What MASO adds | Why single-agent controls aren't enough |
|---|---|
| Inter-agent message bus security | Agents communicating directly create uncontrolled trust boundaries |
| Non-Human Identity per agent | Shared credentials between agents create lateral movement risk |
| Epistemic integrity controls | Hallucinations compound across agent chains; confidence inflates without evidence |
| Transitive authority prevention | Delegation creates implicit privilege escalation |
| Kill switch architecture | Multi-agent cascading failures require system-wide emergency stop |
| Dual OWASP coverage | Agentic Top 10 (2026) risks only exist when agents act autonomously |
| Document | Purpose |
|---|---|
| MASO Overview | Architecture, PACE integration, OWASP dual mapping, 6 control domains |
| Tier 1 — Supervised | Low autonomy: human approves all writes |
| Tier 2 — Managed | Medium autonomy: NHI, signed bus, LLM-as-Judge, continuous monitoring |
| Tier 3 — Autonomous | High autonomy: self-healing PACE, adversarial testing, isolated kill switch |
| Red Team Playbook | 13 adversarial test scenarios for multi-agent systems |
| Integration Guide | LangGraph, AutoGen, CrewAI, AWS Bedrock implementation patterns |
| Worked Examples | Financial services, healthcare, critical infrastructure |
Extensions¶
| Folder | Contents |
|---|---|
| Regulatory | ISO 42001 and EU AI Act mapping |
| Technical | Bypass prevention, metrics |
| Industry Solutions | Guardrails, evaluators, and safety model reference |
| Templates | Risk assessment templates, implementation plans |
| Worked Examples | Per-tier implementation walkthroughs |
Insights¶
Foundational Arguments
| Article | Key Argument |
|---|---|
| The First Control: Choosing the Right Tool | Design thinking before technology selection |
| Why Your AI Guardrails Aren't Enough | Guardrails block known-bad; you need detection for unknown-bad |
| The Judge Detects. It Doesn't Decide. | Async evaluation beats real-time blocking for nuanced decisions |
| Infrastructure Beats Instructions | You can't secure AI systems with prompts alone |
| Risk Tier Is Use Case, Not Technology | Classification reflects deployment context, not model capability |
| Humans Remain Accountable | AI assists decisions; humans own outcomes |
Emerging Challenges
| Article | Key Argument |
|---|---|
| The Verification Gap | Current safety approaches can't confirm ground truth |
| Behavioral Anomaly Detection | Aggregating signals to detect drift from expected behaviour |
| Multimodal AI Breaks Your Text-Based Guardrails | Images, audio, and video create new attack surfaces |
| When AI Thinks Before It Answers | Reasoning models need reasoning-aware controls |
| When Agents Talk to Agents | Multi-agent accountability gaps → see MASO |
| The Memory Problem | Long context and persistent memory introduce novel risks |
| You Can't Validate What Hasn't Finished | Real-time streaming challenges existing validation |
| Open-Weight Models Shift the Burden | Self-hosted models inherit the provider's control responsibilities |
| When the Judge Can Be Fooled | The Judge layer needs its own threat model |
Platforms Implementing This Pattern¶
This isn't a theoretical proposal. These platforms already implement variants of the three-layer pattern:
| Platform | Approach |
|---|---|
| NVIDIA NeMo Guardrails | Five rail types: input, dialog, retrieval, execution, output |
| LangChain | Middleware chains with human-in-the-loop |
| Guardrails AI | Open-source validator framework |
| Galileo | Eval-to-guardrail lifecycle |
| DeepEval | LLM-as-judge evaluation framework |
| AWS Bedrock Guardrails | Managed input/output filtering |
| Azure AI Content Safety | Content filtering and moderation |
Standards Alignment¶
| Standard | Relevance | Mapping |
|---|---|---|
| OWASP LLM Top 10 | Security vulnerabilities in LLM applications | OWASP mapping |
| OWASP Agentic Top 10 | Risks specific to autonomous AI agents | MASO mapping |
| NIST AI RMF | AI risk management framework | NIST mapping |
| ISO 42001 | AI management system standard | ISO 42001 mapping |
| NIST SP 800-218A | Secure development for generative AI | SP 800-218A mapping |
| MITRE ATLAS | Adversarial threat landscape for AI | MASO threat intelligence |
| DORA | Digital operational resilience | MASO regulatory alignment |
Scope¶
In scope: Custom LLM applications, AI decision support, document processing, single-agent systems — from deployment through incident response.
Out of scope: Vendor AI products (use vendor controls), model training (see MLOps security guidance), and pre-deployment testing. This framework is about what happens in production.
Pre-deployment complement: For secure development practices covering data sourcing, training, fine-tuning, and model release, see NIST SP 800-218A. This framework begins where SP 800-218A ends.
For multi-agent systems: See MASO.
Contributing¶
Feedback, corrections, and extensions welcome. See CONTRIBUTING.md.¶
AI Runtime Behaviour Security, 2026 (Jonathan Gill).