Skip to content

AI Runtime Behaviour Security — Single-Agent Controls

Runtime behavioural security for single-model AI deployments. Guardrails, LLM-as-Judge, and human oversight — scaled to the risk.

Part of the AI Runtime Behaviour Security Version 1.0 · February 2026 · Jonathan Gill

License: MIT


How This Section Is Organised

This page is the conceptual overview — it explains the architecture, the risk-scaling model, and how the pieces connect. The implementation details — risk classification criteria, specific control definitions, checklists, and specialised controls for multimodal, reasoning, streaming, and memory — live in the Core directory.

If you want to... Go here
Understand the architecture and principles You're in the right place — keep reading
Classify a system and select controls Core: Risk TiersControls
See the implementation checklist Core: Checklist
Read specialised controls (multimodal, reasoning, streaming, memory) Core: Specialised Controls

Architecture

Single-Agent Security Architecture

Three layers, one principle: you can't fully test a non-deterministic system before deployment, so you continuously verify behaviour in production.

Layer 1 — Guardrails block known-bad inputs and outputs at machine speed (~10ms). Deterministic pattern matching: content filters, PII detection, topic restrictions, rate limits. Every request passes through. No exceptions.

Layer 2 — LLM-as-Judge catches unknown-bad through independent model evaluation (~500ms–5s). A separate LLM evaluates task agent outputs against policy, factual grounding, tone, and safety criteria. Catches what guardrails can't pattern-match.

Layer 3 — Human Oversight provides the accountability backstop. Scope scales with risk: low-risk systems get spot checks, high-risk systems get human approval before commit. Humans decide edge cases. Humans own outcomes.

Circuit Breaker stops all AI traffic and activates a non-AI fallback when any layer fails. Not a degradation — a full stop with a predetermined safe state.

This pattern already exists in production at major platforms: NVIDIA NeMo, AWS Bedrock, Azure AI, LangChain, Guardrails AI, and others. This framework provides the vendor-neutral implementation: risk classification, controls, fail postures, and tested fallback paths.


Get Started

If you want to... Go here
Get the whole framework on one page Cheat Sheet / Decision Poster
Deploy low-risk AI fast Fast Lane
Understand the concepts in 30 minutes Quick Start
Implement controls with working code Implementation Guide
Classify a system by risk Risk Tiers
Deploy an agentic AI system Agentic Controls
Understand what happens when controls fail PACE Resilience
Enforce controls at the infrastructure layer Infrastructure Controls
Track your implementation Checklist
Secure a multi-agent system MASO Framework

Before You Build Controls

The First Control: Choosing the Right Tool

The most effective way to reduce AI risk is to not use AI where it doesn't belong. Before guardrails, judges, or human oversight — ask whether AI is the right tool for this problem.

If your deployment is internal, read-only, handles no regulated data, and has a human reviewing output — start with the Fast Lane. You may not need the rest.


Risk-Scaled Controls

Controls scale to risk so low-risk AI moves fast and high-risk AI stays safe.

Risk Tier Controls Required PACE Posture Use Case Examples
Low Fast Lane: minimal guardrails, self-certification P only (fail-open with logging) Internal chatbots, document summarisation, code assistance
Medium Guardrails + Judge, periodic human review P + A configured Customer-facing content, recommendation engines, search
High All three layers, human-in-the-loop for writes P + A + C configured and tested Financial advice, medical support, regulatory decisions
Critical Full architecture, mandatory human approval Full PACE cycle with tested E→P recovery Autonomous actions on regulated data, safety-critical systems

Classify your system: Risk Tiers


PACE Resilience

Every control has a defined failure mode. The PACE methodology ensures that when a control layer degrades — and it will — the system fails safely rather than silently.

Primary: All layers operational. Normal production.

Alternate: One layer degraded. Backup activated. Scope tightened. Example: Judge layer is down → guardrails remain active, all outputs queued for human review.

Contingency: Multiple layers degraded. AI operates in supervised-only mode. Human approves every action. Reduced capacity, high assurance.

Emergency: Confirmed compromise or cascading failure. Circuit breaker fires. AI traffic stopped. Non-AI fallback activated. Incident response engaged.

Even at the lowest risk tier, there's a fallback plan. At the highest, there's a structured degradation path from full autonomy to full stop.


Core Documents

Document Purpose
Cheat Sheet Entire framework on one page — classify, control, fail posture, test
Decision Poster Visual one-page reference
Fast Lane Pre-approved minimal controls for low-risk AI
Risk Tiers Classify your system, determine control and resilience requirements
Risk Assessment Quantitative control effectiveness, residual risk per tier, NIST AI RMF aligned
Controls Guardrails, Judge, and Human Oversight implementation with per-layer fail postures
Agentic Controls for single autonomous AI agents including graceful degradation path
PACE Resilience What happens when controls fail
Checklist Track implementation and PACE verification progress
Emerging Controls Multimodal, reasoning, and streaming considerations (theoretical)

Infrastructure Controls

This framework defines what to enforce. The infrastructure section defines how — 80 technical controls across 11 domains, with standards mappings and platform-specific patterns.

Domains: Identity & Access Management (8), Logging & Observability (10), Network & Segmentation (8), Data Protection (8), Secrets & Credentials (8), Supply Chain (8), Incident Response (8), Tool Access (6), Session & Scope (5), Delegation Chains (5), Sandbox Patterns (6).

Standards mappings: Every control maps to the three-layer model, ISO 42001 Annex A, NIST AI RMF, and OWASP LLM/Agentic Top 10.

Platform patterns: AWS Bedrock, Azure AI, and Databricks reference architectures.


When You Need Multi-Agent

When AI agents collaborate, delegate tasks, and take autonomous actions across trust boundaries, the single-agent controls on this page are necessary but not sufficient. The MASO Framework extends this architecture into multi-agent orchestration.

What MASO adds Why single-agent controls aren't enough
Inter-agent message bus security Agents communicating directly create uncontrolled trust boundaries
Non-Human Identity per agent Shared credentials between agents create lateral movement risk
Epistemic integrity controls Hallucinations compound across agent chains; confidence inflates without evidence
Transitive authority prevention Delegation creates implicit privilege escalation
Kill switch architecture Multi-agent cascading failures require system-wide emergency stop
Dual OWASP coverage Agentic Top 10 (2026) risks only exist when agents act autonomously
Document Purpose
MASO Overview Architecture, PACE integration, OWASP dual mapping, 6 control domains
Tier 1 — Supervised Low autonomy: human approves all writes
Tier 2 — Managed Medium autonomy: NHI, signed bus, LLM-as-Judge, continuous monitoring
Tier 3 — Autonomous High autonomy: self-healing PACE, adversarial testing, isolated kill switch
Red Team Playbook 13 adversarial test scenarios for multi-agent systems
Integration Guide LangGraph, AutoGen, CrewAI, AWS Bedrock implementation patterns
Worked Examples Financial services, healthcare, critical infrastructure

Extensions

Folder Contents
Regulatory ISO 42001 and EU AI Act mapping
Technical Bypass prevention, metrics
Industry Solutions Guardrails, evaluators, and safety model reference
Templates Risk assessment templates, implementation plans
Worked Examples Per-tier implementation walkthroughs

Insights

Foundational Arguments

Article Key Argument
The First Control: Choosing the Right Tool Design thinking before technology selection
Why Your AI Guardrails Aren't Enough Guardrails block known-bad; you need detection for unknown-bad
The Judge Detects. It Doesn't Decide. Async evaluation beats real-time blocking for nuanced decisions
Infrastructure Beats Instructions You can't secure AI systems with prompts alone
Risk Tier Is Use Case, Not Technology Classification reflects deployment context, not model capability
Humans Remain Accountable AI assists decisions; humans own outcomes

Emerging Challenges

Article Key Argument
The Verification Gap Current safety approaches can't confirm ground truth
Behavioral Anomaly Detection Aggregating signals to detect drift from expected behaviour
Multimodal AI Breaks Your Text-Based Guardrails Images, audio, and video create new attack surfaces
When AI Thinks Before It Answers Reasoning models need reasoning-aware controls
When Agents Talk to Agents Multi-agent accountability gaps → see MASO
The Memory Problem Long context and persistent memory introduce novel risks
You Can't Validate What Hasn't Finished Real-time streaming challenges existing validation
Open-Weight Models Shift the Burden Self-hosted models inherit the provider's control responsibilities
When the Judge Can Be Fooled The Judge layer needs its own threat model

Platforms Implementing This Pattern

This isn't a theoretical proposal. These platforms already implement variants of the three-layer pattern:

Platform Approach
NVIDIA NeMo Guardrails Five rail types: input, dialog, retrieval, execution, output
LangChain Middleware chains with human-in-the-loop
Guardrails AI Open-source validator framework
Galileo Eval-to-guardrail lifecycle
DeepEval LLM-as-judge evaluation framework
AWS Bedrock Guardrails Managed input/output filtering
Azure AI Content Safety Content filtering and moderation

Standards Alignment

Standard Relevance Mapping
OWASP LLM Top 10 Security vulnerabilities in LLM applications OWASP mapping
OWASP Agentic Top 10 Risks specific to autonomous AI agents MASO mapping
NIST AI RMF AI risk management framework NIST mapping
ISO 42001 AI management system standard ISO 42001 mapping
NIST SP 800-218A Secure development for generative AI SP 800-218A mapping
MITRE ATLAS Adversarial threat landscape for AI MASO threat intelligence
DORA Digital operational resilience MASO regulatory alignment

Scope

In scope: Custom LLM applications, AI decision support, document processing, single-agent systems — from deployment through incident response.

Out of scope: Vendor AI products (use vendor controls), model training (see MLOps security guidance), and pre-deployment testing. This framework is about what happens in production.

Pre-deployment complement: For secure development practices covering data sourcing, training, fine-tuning, and model release, see NIST SP 800-218A. This framework begins where SP 800-218A ends.

For multi-agent systems: See MASO.


Contributing

Feedback, corrections, and extensions welcome. See CONTRIBUTING.md.

AI Runtime Behaviour Security, 2026 (Jonathan Gill).