Skip to content

Cross-Functional Exercise: The Atlas Pipeline

This exercise tests whether you can

  • Apply the five-beat reasoning across security, governance, and engineering perspectives
  • Identify where role-specific concerns create productive tension
  • Collaborate (or simulate collaboration) to produce a coherent agent chain risk assessment
  • Recognise that AI runtime security is a cross-functional discipline

Background

Atlas Insurance is a mid-sized commercial insurer. They're deploying a multi-agent pipeline for complex claims processing, their highest-volume, highest-value operation.

The pipeline handles claims from initial submission through to payout decision:

Four-agent claims pipeline: Intake, Verify, Policy Check, Decision

  • Agent I (Intake) parses submitted claim documents, extracts key details (claimant, date of loss, amount, type, supporting evidence references).
  • Agent V (Verification) cross-references the evidence: checks photos against damage descriptions, verifies repair estimates against market rates, confirms dates against policy periods.
  • Agent P (Policy Check) evaluates the verified claim against policy terms: coverage limits, exclusions, deductibles, waiting periods.
  • Agent D (Decision) produces a recommendation: approve, deny, or escalate to a human adjuster.

Scale: 1,200 claims per day. Average claim value: $14,000. The pipeline currently operates at Tier 1 (human adjuster reviews every decision). Atlas wants to move to Tier 2 (human reviews only flagged/high-value claims) to reduce processing time from 5 days to 1 day.


Your task

You're part of a cross-functional team assessing whether the Atlas pipeline is ready for Tier 2 operation. Each role has a specific assessment to make:


Part 1: Security Architecture Assessment

If you completed the Security Architects track, this is your primary section. Otherwise, take this perspective.

Agent V (Verification) cross-references claim photos against damage descriptions. It uses a multimodal model to analyse photos and compare them against the written damage report.

Concern

During testing, the team discovered that when Agent V receives more than 8 photos for a single claim, it processes the first 8 and summarises the remainder based on the written description alone, without explicitly flagging that not all photos were independently verified. Its output says "photos verified against damage description" regardless of whether all photos were individually analysed.

Your assessment questions:

  1. Which MASO domain does this failure fall under?
  2. Where in the three-layer architecture would you place the control to catch this?
  3. What verification evidence would you need before approving Tier 2 operation?
  4. How does this interact with Agent P and Agent D's trust in Agent V's output?

Part 2: Governance & Risk Assessment

If you completed the Risk & Governance track, this is your primary section. Otherwise, take this perspective.

Atlas Insurance is regulated by the state insurance commissioner. They must demonstrate that claims decisions are "fair, timely, and based on a complete review of all submitted evidence."

Concern

At Tier 1, a human adjuster reviews every decision, so the "complete review" obligation is satisfied by the human. At Tier 2, the human only reviews flagged claims. For unflagged claims, the pipeline's decision goes directly to the claimant. The question is: does the pipeline's automated review satisfy the regulatory obligation for "complete review of all submitted evidence"?

Your assessment questions:

  1. What does "complete review" mean when an agent processed 8 of 12 photos?
  2. What governance framework would you apply to the Tier 1 → Tier 2 transition?
  3. What oversight evidence would you need to present to the insurance commissioner?
  4. How do you handle accountability when Agent D makes a deny decision based on Agent V's incomplete verification?

Part 3: Engineering Assessment

If you completed the Engineering Leads track, this is your primary section. Otherwise, take this perspective.

The engineering team needs to instrument the pipeline for Tier 2 operation. Currently, they have per-agent logging and basic latency monitoring.

Concern

Agent V's photo processing limit is a known constraint of the multimodal model: it can handle 8 images per context window. The engineering team has proposed three options: (a) batch photos into groups of 8 with a merge step, (b) upgrade to a model that handles more images, or (c) add a pre-processing step that selects the 8 most relevant photos. All three options have trade-offs in cost, latency, and accuracy.

Your assessment questions:

  1. What instrumentation do you need to make Agent V's photo coverage visible to the chain?
  2. How would you implement a verification receipt for Agent V that downstream agents can check?
  3. Which of the three options (batch, upgrade, or select) would you recommend, and what are the runtime monitoring implications of each?
  4. What chain-level integrity test would you design to catch the "8 of 12 photos" failure?

Part 4: The Cross-Functional Decision

Now bring the three perspectives together. The team needs to make a single recommendation to Atlas Insurance's CTO:

Should Atlas move the claims pipeline from Tier 1 to Tier 2?

Option A: Approve Tier 2 immediately

The pipeline handles 1,200 claims/day well. Six months of Tier 1 data shows 99.2% agreement between the pipeline's recommendation and the human adjuster's final decision. The photo verification issue affects ~4% of claims (those with 9+ photos). The business case is strong.

Option B: Conditional approval

Approve Tier 2, but only for claims with 8 or fewer photos. Claims with 9+ photos remain at Tier 1 (human review). Set a 90-day review period with enhanced monitoring.

Option C: Delay until controls are in place

Don't approve Tier 2 until: (1) Agent V's photo processing limit is resolved, (2) chain-level integrity monitoring is deployed, (3) epistemic integrity verification receipts are implemented at every inter-agent boundary.

Option D: Phased rollout

Move to Tier 2 for the lowest-risk claim categories first (e.g., simple property claims under $5,000). Expand categories as controls are validated. Full Tier 2 in 6 months.


Think it through

For each option, consider:

  • Security: Does the control architecture support this level of automation?
  • Governance: Can you demonstrate regulatory compliance to the insurance commissioner?
  • Engineering: Can you build, instrument, and maintain the required controls?

Write your recommendation and your reasoning before revealing the analysis.

Analysis (click to reveal)

Option A (Approve immediately) ignores the photo verification gap. The 99.2% agreement rate during Tier 1 is misleading, as the human adjusters may have caught cases that the pipeline wouldn't have flagged at Tier 2. More critically, the "complete review" regulatory obligation isn't satisfied when 4% of claims have only partial photo verification.

Option B (Conditional approval) is operationally clean and directly addresses the known risk. Claims with 8 or fewer photos get Tier 2 automation; claims with more stay at Tier 1. This satisfies the regulatory requirement for those automated claims while keeping the business benefits for 96% of volume. The risk: you're now maintaining two operational modes, which increases engineering complexity.

Option C (Delay) is the most conservative and may be disproportionate. The business loses the processing time reduction for all claims while waiting for full controls. If the controls take 3 months to implement, that's 3 months × 1,200 claims/day × (5 days - 1 day) average improvement = significant claimant impact. Risk and governance teams need to weigh the risk of partial automation against the cost of delay, including the cost to claimants of slower processing.

Option D (Phased rollout) is the most nuanced. It lets the team validate controls at low risk before expanding. It provides real production evidence of Tier 2 operation. And it gives engineering time to resolve the photo processing limit while delivering business value. The trade-off: phased rollouts are operationally complex, and category-based thresholds can create edge cases.

A strong cross-functional answer would likely be Option B as the immediate step with Option D as the medium-term plan:

  1. Now: Conditional Tier 2 for claims with ≤8 photos. Engineering adds a photo count check to Agent I's output.
  2. 30 days: Deploy chain-level integrity monitoring and verification receipts.
  3. 60 days: Implement Agent V photo batching or model upgrade. Begin Tier 2 for higher-photo claims under enhanced monitoring.
  4. 90 days: Review data and decide on full Tier 2 expansion.

This approach satisfies all three perspectives:

  • Security: Controls are validated before they're relied upon
  • Governance: Regulatory obligation is met for every automated claim; evidence trail demonstrates progressive assurance
  • Engineering: Implementation is staged, allowing the team to validate each component before adding the next

The key insight: No single role's perspective produces this answer alone. The security architect would likely lean toward Option C (full controls first). The governance professional might lean toward Option B (clear regulatory boundary). The engineering lead might lean toward Option D (phased, data-driven). The combined answer is more robust than any individual one.


After the exercise

You've completed the learning programme. Here's what you should be taking away:

The mental model

Multi-agent AI systems create a new class of failure: correct-looking outputs from broken reasoning chains. This failure is invisible to conventional monitoring because it doesn't look like an error at any single point in the chain.

The core concept

Epistemic integrity means verifying that each agent's claims are warranted by the data it actually accessed. Not by the data it should have accessed. Not by the data its output implies it accessed. By the data it actually accessed.

The framework

MASO provides eight control domains designed for multi-agent systems. The AIRS three-layer architecture (Guardrails, Model-as-Judge, Human Oversight) provides the runtime defence stack. Together, they address the gaps that conventional AI security controls miss.

The cross-functional imperative

AI runtime security is not a security team problem, a governance team problem, or an engineering team problem. It's a problem that lives in the gaps between those teams. The Phantom Compliance scenario and the Atlas Pipeline exercise demonstrate that the solutions must be equally cross-functional.


Continue learning

The AIRS Framework

The full framework: architecture, controls, implementation patterns, and worked examples.

View on GitHub →

Read the Framework Documentation

The complete AIRS documentation site with all MASO controls, stakeholder guides, and technical references.

Read the docs →

Explore Another Track

Did you only complete one learning track? Try another perspective, as the same material looks very different from a different role.

Back to tracks →