1. What Goes Wrong¶

1What goes wrong

2Controls miss it

3Epistemic integrity

4MASO controls

5Verification

After this module you will be able to¶

Identify three distinct failure modes in multi-agent architectures
Explain why each failure mode is invisible to perimeter-style security controls
Map the Phantom Compliance scenario to a generalised threat model for agent chains

Beyond Phantom Compliance¶

The scenario showed you one failure: incomplete retrieval leading to a confident but wrong compliance check. As a security architect, you need to see the class of failures this represents.

Multi-agent systems introduce three failure categories that don't exist in single-agent or traditional software systems:

Failure Mode 1: Reasoning-basis corruption¶

An agent produces correct output given what it knew, but what it knew was incomplete, stale, or subtly wrong.

The Phantom Compliance case is an example: Agent B's output was logically valid given the partial data it had. The failure wasn't in the reasoning; it was in the reasoning inputs.

Other instances:

An agent retrieves a cached version of a policy document that was updated 20 minutes ago
A summarisation agent receives a truncated input and summarises what's there without flagging the truncation
A tool-calling agent receives a partial API response due to a timeout and processes it as complete

Why it's hard to detect: The agent's output is internally consistent. Guardrails pass. Output quality checks pass. The only signal is in metadata about the retrieval or context construction step, and most monitoring stacks don't inspect that step.

Failure Mode 2: Chain-of-trust propagation¶

A downstream agent treats an upstream agent's output as authoritative without independent verification. An error in one link propagates through the entire chain, often gaining confidence at each step.

In Phantom Compliance, Agent C had no mechanism to question Agent B's compliance assessment. It took the "CLEAR" status and incorporated it into its decision as a verified fact.

Why it's hard to detect: Each agent in the chain is operating correctly on the data it has. The failure is in the inter-agent trust model, which is usually implicit (and usually wrong). The chain looks fine from any single agent's perspective.

Failure Mode 3: Emergent behaviour in delegation¶

When agents can delegate tasks to other agents, behaviours emerge that weren't designed and weren't tested. An agent might:

Delegate a subtask to another agent that uses a different (less capable or less constrained) model
Create a sub-chain that bypasses controls applied to the main chain
Delegate iteratively until the original intent is diluted or lost

Why it's hard to detect: Delegation is often a feature, not a bug. The monitoring challenge is distinguishing intended delegation from unsafe delegation, and knowing when a delegation chain has drifted too far from the original task's constraints.

The threat model¶

For a security architect, these three failure modes map to a threat model:

Failure mode	Threat	Attack surface	Impact
Reasoning-basis corruption	Agent acts on incomplete/wrong data	RAG retrieval, tool calls, context construction	Confident wrong output
Chain-of-trust propagation	Upstream errors amplified downstream	Inter-agent interfaces	Wrong decisions with full audit trail
Emergent delegation	Sub-chains bypass controls	Agent-to-agent delegation	Uncontrolled execution paths

Notice what's not in this threat model: prompt injection, jailbreaking, model extraction. Those are real threats, and you still need controls for them. But they're well-understood and widely covered. The three failures above are the ones that most security architectures miss, because they require you to think about chains, not agents.

Architectural insight: The unit of security analysis for AI runtime security is the chain, not the agent. Controls that only inspect individual agents will miss every failure mode that originates in the interactions between agents.

Reflection

Think about an AI system in your organisation (or one you're designing). How many of its security controls are applied to individual agents vs. applied to the chain as a whole? Where would a Phantom Compliance-style failure hide?

Next: Why Controls Miss It →