1. What Breaks in Practice¶

1What breaks

2Tools miss it

3Epistemic integrity

4MASO controls

5Instrumentation

After this module you will be able to¶

Identify five production failure modes in multi-agent systems that don't trigger conventional alerts
Explain why each failure mode produces clean logs and passing health checks
Map the Phantom Compliance scenario to a broader class of silent multi-agent failures
Prioritise which failure modes to instrument first based on blast radius

The 2am page that never fires¶

As an engineering lead, you have spent years building intuition about what breaks in production. You know the symptoms: error rates spike, latency goes red, a dependency throws 503s. Your dashboards catch it. Your on-call gets paged. You triage, mitigate, resolve.

Multi-agent AI systems break differently. The failure modes in this module share a common trait: they don't page you. No errors. No latency spikes. No dependency failures. The system continues to operate, produce outputs, and pass health checks. The damage accumulates silently until a human notices the consequences, sometimes days or weeks later.

The Phantom Compliance scenario is one instance of this pattern. This module covers the full landscape.

Failure Mode 1: Truncated retrieval¶

Production example

A compliance-checking agent retrieves documents from a vector store to verify a trade against restricted securities. The retrieval query returns results, but the context window can only hold 47 of 312 matching documents. The agent processes what fits, finds no violations in the 47 documents it saw, and returns CLEAR.

The restricted security was in document 204.

What the logs show: A successful retrieval call, a successful LLM inference, a well-formed response. Every span in the trace completes normally. Latency is within bounds.

What the logs don't show: That the agent only saw 15% of the relevant data.

This is the Phantom Compliance failure. It generalises far beyond compliance checking:

Customer support chains: A research agent retrieves product documentation but context limits truncate the results. The response agent gives confidently wrong guidance based on partial docs.
Code review agents: A security scanning agent retrieves vulnerability databases but hits a result limit. It reports "no known vulnerabilities" for a dependency that has three active CVEs, all on page 2 of the results.
Data analysis pipelines: A summarisation agent receives a dataset description that was truncated during inter-agent message passing. It analyses the partial dataset as if it were complete.

The pattern: An agent receives partial data, processes it correctly, and produces output that is internally consistent but factually incomplete. No component errors. No validation failures. The output looks exactly like a correct output.

Failure Mode 2: Context overflow and silent priority¶

When an agent's context window fills up, something gets dropped. What gets dropped depends on the framework, the prompt structure, and sometimes the model's internal attention patterns. This is rarely deterministic and almost never logged.

Production example

A multi-step planning agent receives:

The user's request (200 tokens)
System instructions including safety constraints (800 tokens)
Retrieved context from previous interactions (2,400 tokens)
Tool call results from three parallel API calls (4,100 tokens)
A chain-of-thought scratchpad from a prior reasoning step (1,800 tokens)

Total: 9,300 tokens. The inter-agent message budget is 8,000 tokens. The framework silently truncates from the middle, dropping most of item 3 (prior interaction context) and part of item 4 (one of the three API results).

The agent plans a workflow that ignores a constraint the user stated three interactions ago and misses data from the dropped API call. The plan looks reasonable in isolation.

What the logs show: Successful context assembly. Successful inference. All API calls completed.

What the logs don't show: That 1,300 tokens of context were silently dropped, including a user constraint and an API result.

The engineering challenge here is that context window management is handled differently by every framework:

LangChain truncates from the beginning of the conversation history by default
AutoGen may drop earlier messages in multi-turn group chats
Custom implementations often have ad-hoc truncation logic buried in utility functions

None of these frameworks emit a metric when truncation occurs. You have to build that instrumentation yourself.

Failure Mode 3: Stale tool responses¶

Agents call tools (APIs, databases, search engines, code interpreters). Those tools can return stale data without any indication that the data is stale.

Production example

An investment research agent chain operates as follows:

Agent A identifies securities to analyse
Agent B retrieves current pricing data via a market data API
Agent C performs valuation analysis and produces recommendations

The market data API has a caching layer. During a period of high volatility, the cache serves 15-minute-old prices for three securities. Agent B receives the prices, notes them as "current market data," and passes them to Agent C. Agent C's valuation analysis is based on prices that have moved 3-8% in the intervening 15 minutes.

What the logs show: Successful API calls with 200 responses. Valid JSON payloads. Normal latency.

What the logs don't show: That the last_updated field in the API response (buried three levels deep in the JSON) was 15 minutes old, not real-time.

Stale tool responses are particularly dangerous in multi-agent systems because:

The calling agent often doesn't inspect response metadata; it extracts the data fields it needs and discards the rest
Downstream agents have no visibility into the tool call at all, seeing only the extracted data without any provenance
Caching can happen at multiple layers (API gateway, CDN, application cache, framework-level response cache), and none of these layers coordinate with agent-level freshness requirements

Failure Mode 4: Delegation loops and constraint dilution¶

When agents can delegate tasks to other agents, the delegation chain can drift from the original intent. Each delegation step may slightly reframe the task, and constraints from the original request can be lost or weakened.

Production example

A customer-facing agent receives a request: "Find me flights from London to New York next Tuesday, but not via any US airlines. I have a loyalty programme dispute with them."

The agent delegates to a search agent: "Find flights LHR to JFK next Tuesday." The constraint about US airlines is treated as a preference, not a hard requirement, because the delegating agent summarised the request.

The search agent delegates to an API-calling agent: "Query flight API for LHR-JFK routes on [date]." The US airline constraint is now absent entirely.

The result set includes United and American Airlines flights. The customer-facing agent filters some of them out based on its memory of the constraint, but presents a Delta flight because Delta wasn't mentioned by name (the user said "US airlines" as a category).

What the logs show: Successful delegation. Successful API calls. Results returned. User received a response.

What the logs don't show: That the original constraint was diluted at each delegation step and imperfectly reconstructed at the end.

Constraint dilution in delegation chains is the multi-agent equivalent of the telephone game. It is especially dangerous when:

Delegation crosses model boundaries: The delegating agent uses GPT-4 but the sub-agent uses a smaller, less capable model that handles nuance differently
Delegation crosses trust boundaries: A sub-agent has different access permissions and bypasses controls that applied to the parent agent
Delegation is recursive: Agent A delegates to Agent B, which delegates to Agent C, which delegates to Agent D. By the time the result returns, three reframing steps have occurred

Failure Mode 5: Semantic drift in long-running chains¶

In agent chains that process multiple items or operate over extended periods, the agents' interpretation of their task can drift subtly from the original specification.

Production example

A document processing pipeline reviews 200 insurance claims per day. The chain is:

Agent A extracts claim details from submitted documents
Agent B categorises the claim and identifies the applicable policy
Agent C assesses the claim against policy terms and produces a recommendation

After three weeks of operation, the team notices that Agent C's approval rate has increased from 72% to 89%. No code changes were made. No model updates occurred.

Investigation reveals: Agent B's categorisation has been gradually shifting borderline claims into categories with more lenient policy terms. This isn't adversarial; it's the result of Agent B's few-shot examples being drawn from its own recent outputs (a feedback loop in the example selection logic). As it categorised more borderline claims as "standard," those categorisations became examples for future borderline claims.

What the logs show: Consistent throughput. Stable latency. No errors. Gradually improving "efficiency" (faster processing times as categorisation became more confident).

What the logs don't show: That categorisation accuracy degraded by 14% over three weeks, masked by increasing confidence scores.

The common thread: Every failure mode in this module produces clean telemetry. No errors, no timeouts, no validation failures. The system appears healthy by every conventional metric. The failure is in the quality and completeness of the reasoning process, and conventional monitoring doesn't measure that.

Blast radius and prioritisation¶

Not all failure modes are equally urgent to instrument. Here is a practical prioritisation framework:

Failure mode	Detection difficulty	Blast radius	Recommended priority
Truncated retrieval	Medium (you can count results)	High: wrong decisions on complete-looking data	P0: Instrument first
Context overflow	Hard (truncation is often silent)	High: safety constraints can be dropped	P0: Instrument first
Stale tool responses	Medium (check response timestamps)	Variable, depends on data sensitivity	P1: Instrument second
Delegation loops	Hard (requires tracking constraint propagation)	Medium, usually caught by end users	P2: Instrument third
Semantic drift	Very hard (requires baseline comparison over time)	High but slow; damage accumulates gradually	P1: Instrument second

The P0 items (truncated retrieval and context overflow) share a property: they can cause immediate, high-confidence wrong outputs. Instrument these before anything else.

What this means for your architecture¶

If you are running multi-agent systems in production today, you probably have:

Health checks that pass
Error rates near zero
Latency within SLAs
Logs that show successful completions

None of these metrics tell you whether your agents are reasoning on complete data. The rest of this track addresses that gap: Module 2 examines why your current observability tools miss these failures, Module 3 introduces the engineering concept you need (epistemic integrity), Module 4 gives you the control framework, and Module 5 shows you what to build.

Reflection

Think about a multi-agent system you operate or are building. Which of the five failure modes is most likely to occur in your specific architecture? What would the blast radius look like: who gets affected, how quickly, and how would you eventually discover the problem?

Consider

Start with your data flow. Where does each agent get its inputs? Which of those inputs could be silently truncated, stale, or incomplete? The failure mode most likely to bite you is the one attached to your most critical data dependency.

Next: Why Current Tools Miss It →