1. What Goes Wrong¶
After this module you will be able to¶
- Identify three distinct failure modes in multi-agent architectures
- Explain why each failure mode is invisible to perimeter-style security controls
- Map the Phantom Compliance scenario to a generalised threat model for agent chains
Beyond Phantom Compliance¶
The scenario showed you one failure: incomplete retrieval leading to a confident but wrong compliance check. As a security architect, you need to see the class of failures this represents.
Multi-agent systems introduce three failure categories that don't exist in single-agent or traditional software systems:
Failure Mode 1: Reasoning-basis corruption¶
An agent produces correct output given what it knew, but what it knew was incomplete, stale, or subtly wrong.
The Phantom Compliance case is an example: Agent B's output was logically valid given the partial data it had. The failure wasn't in the reasoning; it was in the reasoning inputs.
Other instances:
- An agent retrieves a cached version of a policy document that was updated 20 minutes ago
- A summarisation agent receives a truncated input and summarises what's there without flagging the truncation
- A tool-calling agent receives a partial API response due to a timeout and processes it as complete
Why it's hard to detect: The agent's output is internally consistent. Guardrails pass. Output quality checks pass. The only signal is in metadata about the retrieval or context construction step, and most monitoring stacks don't inspect that step.
Failure Mode 2: Chain-of-trust propagation¶
A downstream agent treats an upstream agent's output as authoritative without independent verification. An error in one link propagates through the entire chain, often gaining confidence at each step.
In Phantom Compliance, Agent C had no mechanism to question Agent B's compliance assessment. It took the "CLEAR" status and incorporated it into its decision as a verified fact.
Why it's hard to detect: Each agent in the chain is operating correctly on the data it has. The failure is in the inter-agent trust model, which is usually implicit (and usually wrong). The chain looks fine from any single agent's perspective.
Failure Mode 3: Emergent behaviour in delegation¶
When agents can delegate tasks to other agents, behaviours emerge that weren't designed and weren't tested. An agent might:
- Delegate a subtask to another agent that uses a different (less capable or less constrained) model
- Create a sub-chain that bypasses controls applied to the main chain
- Delegate iteratively until the original intent is diluted or lost
Why it's hard to detect: Delegation is often a feature, not a bug. The monitoring challenge is distinguishing intended delegation from unsafe delegation, and knowing when a delegation chain has drifted too far from the original task's constraints.
The threat model¶
For a security architect, these three failure modes map to a threat model:
| Failure mode | Threat | Attack surface | Impact |
|---|---|---|---|
| Reasoning-basis corruption | Agent acts on incomplete/wrong data | RAG retrieval, tool calls, context construction | Confident wrong output |
| Chain-of-trust propagation | Upstream errors amplified downstream | Inter-agent interfaces | Wrong decisions with full audit trail |
| Emergent delegation | Sub-chains bypass controls | Agent-to-agent delegation | Uncontrolled execution paths |
Notice what's not in this threat model: prompt injection, jailbreaking, model extraction. Those are real threats, and you still need controls for them. But they're well-understood and widely covered. The three failures above are the ones that most security architectures miss, because they require you to think about chains, not agents.
Architectural insight: The unit of security analysis for AI runtime security is the chain, not the agent. Controls that only inspect individual agents will miss every failure mode that originates in the interactions between agents.
Reflection
Think about an AI system in your organisation (or one you're designing). How many of its security controls are applied to individual agents vs. applied to the chain as a whole? Where would a Phantom Compliance-style failure hide?