The Gap¶
You've seen the failure. Now let's understand why the standard security toolkit doesn't catch it.
What most organisations deploy today¶
A typical AI security posture in 2026 looks like this:
| Control | What it does | What it catches |
|---|---|---|
| Input guardrails | Block prompt injection, jailbreaks, toxic inputs | Adversarial inputs |
| Output guardrails | Block harmful, off-topic, or policy-violating outputs | Bad outputs |
| Content filtering | Flag PII, credentials, restricted content in outputs | Data leakage |
| Logging & audit | Record all inputs, outputs, and metadata | Forensic review (after the fact) |
| Rate limiting | Prevent abuse and cost overruns | Resource exhaustion |
| Model evaluation | Benchmark accuracy, bias, and capability pre-deployment | Known weaknesses before production |
These controls are necessary. They catch real attacks and real failures. But they share a common assumption:
The hidden assumption
Every control above evaluates the agent's output or input in isolation. None of them verify that the reasoning process between input and output was based on complete, current, and relevant information.
Applying each control to Phantom Compliance¶
Input guardrails: Agent B received a well-formed input from Agent A. No injection, no jailbreak, no toxicity. Pass.
Output guardrails: Agent B produced a well-formatted compliance assessment. No blocked patterns, no hallucination flags on the text. Pass.
Content filtering: No PII, no credentials, no restricted content in the output. Pass.
Logging: The full transcript was recorded. The logs show a normal pipeline run. The partial retrieval is technically visible in the raw token data, but no alert triggers because no one defined "retrieval completeness" as a monitoring dimension. Pass (the failure is logged but invisible).
Rate limiting: Normal request volume. Pass.
Model evaluation: The model was evaluated before deployment. It performed well on compliance checking benchmarks. But benchmarks test the model with complete inputs. They don't test what happens when retrieval returns partial data under context pressure. Pass (the eval didn't cover this case).
The three gaps¶
Gap 1: No verification of reasoning inputs¶
Current controls verify what goes into the agent (the user or upstream prompt) and what comes out (the response). But they don't verify the intermediate data the agent retrieves or generates during its reasoning process.
In the Phantom Compliance case, Agent B's retrieval of a partial securities list was an intermediate step. No control examined whether that retrieval was complete.
Gap 2: No cross-agent verification¶
Each agent is monitored independently. Agent A's output is checked. Agent B's output is checked. Agent C's output is checked. But nobody asks:
- Is Agent B's output consistent with what a complete check would produce?
- Does Agent C have enough information to verify Agent B's claims?
- Did the chain as a whole maintain integrity, or did it silently degrade?
This is the difference between monitoring agents and monitoring chains.
Gap 3: No distinction between "looks right" and "is right"¶
Guardrails and output quality checks evaluate plausibility: does the output look like a reasonable response? In a multi-agent chain, plausibility is necessary but not sufficient. An output can be plausible, internally consistent, and confidently stated while being based on incomplete data.
The missing capability is epistemic verification: confirming not just that the output looks like a correct compliance check, but that the compliance check was actually performed against complete data.
The cost of these gaps¶
In the Phantom Compliance scenario, the cost was a regulatory violation discovered three days later. But the same structural failure (agents acting on incomplete reasoning inputs, with downstream agents trusting upstream outputs without verification) can manifest as:
- A customer-facing agent giving medical information based on a partial retrieval of contraindication data
- A code generation agent that passes security review because its security-checking agent only evaluated a subset of the generated code
- A procurement agent that approves a vendor because its due diligence agent's search returned truncated results
- An autonomous research agent that draws conclusions from incomplete literature retrieval and passes those conclusions to a decision-making agent
The pattern is always the same: the output looks correct, the logs look clean, and the failure is only discovered when reality doesn't match what the system said.
The gap in one sentence: Current AI security controls verify that agents behave correctly but not that agents reason correctly, and in multi-agent systems, the distinction is the entire attack surface.
Where to go from here¶
You now understand the threat and the gap. The next step depends on your role.
Security Architects¶
You design and integrate security controls into systems
Your thread: threat model → MASO control domains → three-layer architecture → implementation patterns.
Risk & Governance¶
You own risk frameworks, compliance, and oversight obligations
Your thread: threat model → why governance doesn't cover agent chains → MASO as extension layer → oversight obligations.
Engineering Leads¶
You build and operate AI systems in production
Your thread: threat model → what breaks in practice → runtime vs design-time controls → instrumentation.