Skip to content

2. Why Controls Miss It

1What goes wrong
2Controls miss it
3Epistemic integrity
4MASO controls
5Verification

After this module you will be able to

  • Explain why the standard three-layer AI security pattern (with circuit breaker) is necessary but insufficient for multi-agent systems
  • Identify the architectural blind spot in perimeter-model security applied to AI
  • Describe the difference between output monitoring and reasoning-chain monitoring

The standard pattern and where it breaks

Most AI security architectures follow a pattern derived from traditional application security:

Single-agent security pattern

This gives you:

  • Input guardrails: Block bad prompts (injection, jailbreak, policy violations)
  • Output guardrails: Block bad responses (toxicity, data leakage, off-topic)
  • Monitoring: Log everything, alert on anomalies

For a single agent with a single request-response cycle, this pattern works well. It's the AI equivalent of a web application firewall: inspect inbound, inspect outbound, log the middle.

The multi-agent problem

Now apply the same pattern to a three-agent chain:

Multi-agent security pattern with per-agent guardrails and monitoring

Every agent has guardrails. Every agent is monitored. You have six guardrail checkpoints and three monitoring points. This looks thorough.

But what are those guardrails checking at the inter-agent boundaries?

  • Agent A → Agent B: Is A's output well-formed? Does it contain blocked content?
  • Agent B → Agent C: Is B's output well-formed? Does it contain blocked content?

What they're not checking:

  • Is A's output correct? Is it complete?
  • Did B actually perform the task it claims to have performed?
  • Is B's confidence warranted by its actual reasoning process?
  • Does the chain as a whole still serve the original intent?

The guardrails at inter-agent boundaries are doing the same thing as the guardrails at the system boundary: content inspection. They're not doing semantic verification of the reasoning chain.


Three architectural gaps

Gap 1: No reasoning-input verification

The AIRS framework identifies this as a fundamental gap: current architectures verify what agents produce but not what agents consume during reasoning.

In traditional security terms, this is like inspecting HTTP responses without ever checking whether the application queried the right database. You'd catch SQL injection in the input and data leakage in the output, but you'd never know if the application used a stale cache instead of the production database.

For AI systems, the equivalent is:

  • Did the agent's RAG retrieval return complete results?
  • Was the tool call response valid and current?
  • Did the context window contain the data the agent needed?

No standard guardrail checks these.

Gap 2: No inter-agent trust verification

When Agent C receives Agent B's compliance assessment of "CLEAR", it has no mechanism to ask:

  • How complete was Agent B's check?
  • What data sources did Agent B actually access?
  • What was Agent B's retrieval coverage (did it get the full list or a partial one)?

In traditional security, this is like a microservice accepting a JWT from an upstream service without verifying the claims inside it, trusting that the upstream service authenticated the user correctly because the token exists.

The AI equivalent is trusting that an upstream agent did its job because the output looks like it did its job.

Gap 3: No chain-level integrity monitoring

Each agent is monitored independently. But nobody monitors the chain as a whole:

  • Did the chain's output faithfully represent the chain's inputs?
  • Did information degrade or distort as it passed through agents?
  • Did any agent introduce claims that aren't supported by upstream data?

This is a monitoring architecture problem. Most AI observability tools give you per-agent dashboards. Very few give you chain-level integrity metrics.


What this means for architecture

The core insight for security architects: you need a second dimension of controls.

The first dimension (which you probably already have) handles content security, blocking bad inputs and outputs. Call this the horizontal dimension: it operates at each agent's boundary.

The second dimension handles reasoning integrity, verifying that each agent's reasoning was based on complete and current information. Call this the vertical dimension: it operates across the chain, checking the flow of information from input to output.

Two dimensions of controls: horizontal content security and vertical reasoning integrity

The AIRS framework formalises this second dimension through the three-layer architecture (Guardrails, Model-as-Judge, Human Oversight) and the MASO control domains. We'll cover both in the next two modules.


Reflection

Look at your current AI security architecture. Where are the controls placed? Are they all on the horizontal dimension (content security at agent boundaries)? Where would you need vertical controls to catch Phantom Compliance-style failures?


Next: Epistemic Integrity →