Why AI Security Is a Runtime Problem¶
You cannot fully test a non-deterministic system before deployment. Security must shift from "prove it works" to "continuously verify it's working."
The Assumption That Doesn't Hold¶
Enterprise security is built on a fundamental assumption: test the system, prove it works, deploy it, protect it. Static analysis, penetration testing, code review, QA — all happen before production. If the system passes, it ships.
AI breaks this model.
A traditional API returns the same output for the same input. You can test it exhaustively. An AI system returns different outputs for the same input. The same prompt, same model, same temperature — different response. Not occasionally. By design. Non-determinism is the feature, not the bug.
This means your pre-deployment test suite — no matter how thorough — proves the system can behave correctly. It cannot prove the system will behave correctly on the next request.
What This Means in Practice¶
Consider a customer service AI. You test it with 10,000 prompts. It passes every one. You deploy it. On request 10,001, a customer phrases their complaint in a way your test suite never anticipated. The model generates a response that promises a refund your company doesn't offer, includes internal pricing data from its context window, or follows an instruction embedded in the customer's message.
None of these failures are bugs. The model is functioning as designed — generating plausible responses to novel inputs. The problem is that "plausible" and "correct" are not the same thing, and you cannot enumerate every possible input to a system that accepts natural language.
This is not a theoretical risk. Documented incidents include:
- Auto-GPT executing a cryptocurrency transfer after following instructions embedded in a retrieved email
- GitHub Copilot generating remote code execution vulnerabilities from prompt injection in code comments
- Cursor IDE executing arbitrary commands via poisoned configuration files
- A banking AI approving fraudulent transactions that pattern-matched to legitimate requests
- Perplexity AI exfiltrating data through carefully crafted retrieval queries
Each of these systems passed testing. Each failed in production.
The Architecture That Works¶
If pre-deployment testing cannot fully secure an AI system, security must be continuous. Every request, every response, every interaction — monitored in production.
The pattern that major platforms have converged on independently uses three layers:
Layer 1: Guardrails¶
Real-time pattern matching that blocks known-bad inputs and outputs. PII detection, content policy enforcement, prompt injection signatures. Sub-10ms latency. Runs synchronously — the response does not reach the user until the guardrail passes.
Guardrails catch the threats you can define in advance. They are necessary but insufficient. You cannot write a regex for every possible failure mode of a system that generates natural language.
Layer 2: LLM-as-Judge¶
An independent AI model evaluating the primary model's output. Asynchronous — it runs in parallel or after delivery, taking 500ms to 5 seconds. The Judge assesses whether the response is appropriate, safe, within scope, and consistent with the system's purpose.
The Judge catches what guardrails miss: novel failure modes, subtle policy violations, context-dependent risks that require semantic understanding. It detects unknown-bad — the category of failures you could not enumerate in advance.
Critically, the Judge must be independent. Different model, different provider if possible, no shared context with the primary model. If the primary model is compromised, the Judge must not be compromised with it.
Layer 3: Human Oversight¶
Human review for edge cases the automated layers cannot resolve. The scope scales with risk: for low-risk systems, humans review flagged exceptions. For high-risk systems — regulated decisions, autonomous actions, financial transactions — humans approve before execution.
Human oversight is not a bottleneck. It is a control. The three-layer pattern ensures that only genuinely ambiguous cases reach human reviewers, because the guardrails and Judge have already filtered the clear passes and clear failures.
The Circuit Breaker¶
When the layers themselves fail — when the guardrail service goes down, or the Judge starts returning errors, or human reviewers are unavailable — a circuit breaker stops all AI traffic and activates a safe fallback. A deterministic system that returns known-safe responses rather than allowing unmonitored AI to continue serving.
The system fails safely. Not silently.
Why Guardrails Alone Are Not Enough¶
Most enterprise "AI security" today consists of guardrails: input/output filters that block known-bad patterns. This is Layer 1 without Layers 2 and 3.
Guardrails are fast and effective against known threats. But AI failure modes are not limited to known threats. The most dangerous outputs are the ones that look perfectly normal — a response that is fluent, confident, and wrong. A recommendation that is plausible but based on hallucinated data. An action that is technically authorised but semantically inappropriate.
No guardrail catches "this response is factually correct but contextually harmful." That requires semantic evaluation — a Judge.
No guardrail catches "this response is technically within policy but the user's intent is adversarial." That requires human judgement.
The three layers work because they address fundamentally different failure categories: known-bad (guardrails), unknown-bad (Judge), and ambiguous (human). Remove any layer and you have a gap.
When Agents Talk to Agents¶
Single-model AI is hard enough to secure. Multi-agent systems — where multiple AI models collaborate, delegate, and negotiate — compound the problem in ways that single-agent controls do not address.
The Compounding Problem¶
In a multi-agent system, Agent A receives a user request and delegates part of it to Agent B, which queries Agent C for information. Agent C returns data containing an embedded instruction. Agent B acts on the instruction. Agent A incorporates the result without knowing what happened. The user receives a compromised output.
This is prompt injection at scale — not user-to-agent, but agent-to-agent. And it is not the only failure mode:
- Hallucination amplification. Agent A hallucinates a fact. Agent B treats it as verified input. Agent C cites it as established. Three agents agree on something that was never true.
- Transitive authority. Agent A has permission to read files. Agent A delegates to Agent B, which inherits Agent A's permissions. Agent B delegates to Agent C. Now Agent C has file access that was never explicitly granted.
- Epistemic failures. Agents independently reach the same wrong conclusion — not because they are coordinating, but because they share the same training data biases. The system produces confident consensus on a hallucination.
These are not adversarial attacks. They are emergent properties of systems where AI interacts with AI. No attacker is required. The architecture itself is the vulnerability.
What Multi-Agent Security Requires¶
Each agent needs its own identity, its own permissions, its own guardrails, and its own Judge evaluation. Trust between agents must be explicit and scoped — not inherited from the orchestrator and not assumed because "we are all in the same system."
Every agent-to-agent message is untrusted input. Every delegation is a potential privilege escalation. Every consensus is a potential correlated failure.
The controls exist. The challenge is that most organisations are deploying multi-agent systems with single-agent security models, or with no security model at all.
What You Should Do Now¶
If you are deploying AI in an enterprise, you need runtime behavioural security. Here is where to start:
1. Classify your AI systems by use case, not by technology. An internal chatbot summarising documentation is a different risk profile from an autonomous agent executing financial transactions. The controls must be proportionate to the risk, not uniform across all deployments.
2. Deploy guardrails immediately. Input and output filtering is the minimum viable control. PII detection, content policy enforcement, injection pattern matching. This is Layer 1 and it should be non-negotiable for any AI system touching production.
3. Add a Judge. An independent model evaluating your primary model's output. Start with the highest-risk systems. The Judge does not need to be perfect — it needs to catch what the guardrails miss.
4. Define your circuit breaker. What happens when your AI system fails? Not "the model returns an error" — what is the safe fallback? A deterministic response? A human takeover? A service degradation? Decide before you need it.
5. Log everything. Every prompt, every response, every guardrail decision, every Judge evaluation. You cannot secure what you cannot observe. And you cannot investigate an incident retroactively if you did not capture the data.
6. Treat multi-agent deployments as a distinct risk category. If your agents talk to other agents, your single-agent controls are necessary but not sufficient. Agent-to-agent messages are untrusted input. Permissions do not transfer implicitly. Every boundary is an attack surface.
The Bottom Line¶
AI security is not a design-time problem with a design-time solution. It is a runtime problem that requires continuous, layered, independent monitoring of every interaction in production.
The pattern is proven. NVIDIA NeMo Guardrails, AWS Bedrock Guardrails, Azure AI Content Safety, LangChain's evaluation framework — major platforms have converged on the same architecture independently. Guardrails for speed, Judge for depth, humans for nuance, circuit breakers for failure.
The question is not whether your AI systems need runtime behavioural security. The question is whether you will implement it before or after an incident forces you to.
AI Runtime Behaviour Security, 2026 (Jonathan Gill).