Decision Exercise: Engineering Leads¶

This exercise tests whether you can¶

Interpret ambiguous production signals from an agent chain using integrity metrics
Distinguish between operational anomalies and epistemic integrity failures
Make a defensible engineering decision under uncertainty with incomplete data
Apply the PACE resilience framework to a real-time incident

The situation¶

You are the engineering lead for Crestline Logistics, an AI-powered supply chain optimisation platform. Your team operates a four-agent pipeline that processes procurement decisions:

Agent S (Sourcing): Retrieves supplier data, pricing, and availability from internal databases and external APIs
Agent R (Risk): Evaluates supply chain risk by checking geopolitical data, supplier financial health, and compliance status
Agent O (Optimisation): Balances cost, risk, and delivery time to recommend a procurement plan
Agent A (Approval): Validates the recommendation against company policy and spending thresholds, then routes for approval

The system processes approximately 400 procurement decisions per day. Average chain latency is 8 seconds. You deployed MASO controls three weeks ago, including verification receipts, retrieval completeness monitoring, and a chain integrity dashboard.

It is Tuesday at 10:47. You are reviewing the morning's dashboard when you notice something.

The signals¶

Signal 1: Retrieval completeness anomaly¶

Your chain integrity dashboard shows Agent S's retrieval completeness ratio has dropped:

Agent S retrieval completeness chart: declining from 0.97 to 0.81 over 4 hours, crossing the 0.90 warning threshold

The drop is gradual, not sudden. It started around 08:30 and has been declining steadily. The warning threshold of 0.90 was crossed at approximately 09:45. No alerts fired because your alert rule requires the metric to stay below threshold for 15 consecutive minutes, and the value has been hovering around 0.80-0.82 for the last hour.

What you know

Retrieval completeness is below your warning threshold but above your critical threshold (0.50). The decline is gradual, suggesting a progressive issue rather than a hard failure. Agent S is still retrieving data, just less of it than usual. Approximately 50 procurement decisions have been processed since the metric crossed the warning threshold.

Signal 2: PACE alternate path usage¶

Your PACE dashboard shows alternate path activations have increased:

PACE Path Distribution (last 4 hours)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  07:00-08:00   Primary: 98%  Alternate: 2%   Contingency: 0%
  08:00-09:00   Primary: 95%  Alternate: 5%   Contingency: 0%
  09:00-10:00   Primary: 87%  Alternate: 12%  Contingency: 1%
  10:00-10:47   Primary: 82%  Alternate: 16%  Contingency: 2%

PACE alternate activations have increased 8x from baseline (2% to 16%). Contingency path activations (human escalation) have appeared for the first time this week.

What you know

The PACE system is working: it is catching integrity failures and routing to alternate paths. Most alternate paths are succeeding (the retry with forced full retrieval is recovering the missing data). But the increasing trend means the underlying issue is getting worse, and the alternate path adds 3-5 seconds of latency per request. Two requests have gone to contingency (human escalation) in the last hour.

Signal 3: Cross-agent consistency¶

Your cross-agent consistency metric between Agent R (Risk) and Agent O (Optimisation) shows a subtle anomaly:

Cross-Agent Consistency: Agent R → Agent O
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Baseline (30-day avg): 0.91
  Last hour avg:         0.84
  Last 15 min avg:       0.78
  Lowest single chain:   0.52 (chain-28491, 10:31)

Agent O is occasionally producing optimisation recommendations with higher confidence than Agent R's risk assessment warrants. In chain-28491, Agent R assessed a supplier with moderate risk (confidence 0.65, based on limited geopolitical data) and Agent O recommended that supplier as the primary choice (confidence 0.92).

What you know

The consistency metric is declining but not yet at critical levels for most chains. One chain (28491) had notably poor consistency. The pattern suggests that when Agent S's retrieval is incomplete, Agent R's risk assessment is less certain, but Agent O doesn't sufficiently discount its recommendations based on that uncertainty. This is a confidence propagation issue, exactly the pattern from Module 1.

Signal 4: The log trail¶

You pull the logs for chain-28491, the worst-performing chain:

{"timestamp": "10:31:02", "agent": "agent-s", "action": "supplier_retrieval",
 "source": "geopolitical_risk_db", "expected": 156, "actual": 118,
 "completeness": 0.76, "status": "success"}

{"timestamp": "10:31:04", "agent": "agent-r", "action": "risk_assessment",
 "supplier": "Tangshan Heavy Industries", "risk_score": 0.35,
 "confidence": 0.65, "note": "Limited geopolitical data available for region",
 "receipt_integrity": true, "status": "success"}

{"timestamp": "10:31:06", "agent": "agent-o", "action": "optimisation",
 "recommended_supplier": "Tangshan Heavy Industries",
 "cost_saving": "12%", "confidence": 0.92,
 "receipt_integrity": true, "status": "success"}

{"timestamp": "10:31:07", "agent": "agent-a", "action": "approval",
 "decision": "APPROVED", "spending_threshold": "within_limits",
 "policy_check": "passed", "status": "success"}

What you know

The logs tell a story: Agent S retrieved 76% of expected geopolitical data. Agent R correctly noted limited data and reduced its confidence to 0.65. But Agent O ignored the low upstream confidence and produced a 0.92 confidence recommendation, driven primarily by the 12% cost saving. Agent A approved because the recommendation met spending thresholds and policy checks. The receipt integrity checks passed for each individual agent (each agent's data was internally consistent), but the cross-agent consistency check flagged the R-to-O confidence gap.

Your decision¶

You need to decide what to do right now. The pipeline is still running. Procurement decisions are still being processed and approved. The trend lines are all moving in the wrong direction.

Option A: Monitor and wait¶

The PACE system is catching most integrity failures. The retrieval completeness is still at 0.81, well above the critical threshold of 0.50. Let the automated controls handle it while you investigate the root cause of the declining retrieval completeness. No operational changes.

Option B: Tighten thresholds¶

Immediately lower the PACE trigger threshold so more chains route to the alternate path. Change the retrieval completeness warning from 0.90 to 0.85 and the critical from 0.50 to 0.70. This will increase alternate path usage and catch more borderline cases, at the cost of higher latency and more human escalations.

Option C: Pause and investigate¶

Halt the pipeline for new procurement decisions. Existing in-flight chains complete but no new ones start. Investigate the retrieval completeness issue. Audit the 50 decisions made since the metric crossed the warning threshold. Resume when the root cause is identified and fixed.

Option D: Partial pause with targeted intervention¶

Keep the pipeline running for standard procurement (low-value, low-risk). Halt processing for high-value or high-risk procurement decisions (route to manual processing). Investigate the retrieval completeness decline in parallel. Audit chain-28491 and similar high-confidence-gap chains specifically.

Think it through¶

Before reading the analysis, make your choice. Write down:

What you would do
Why
What evidence would change your mind
What the worst-case outcome is if you're wrong

The exercise is more valuable if you commit before seeing the discussion.

Analysis (click to reveal after you've decided)

There is no single correct answer. Each option makes different trade-offs between operational continuity and risk management. Here's the engineering analysis:

Option A (Monitor and wait) trusts your automated controls. The PACE system is catching most failures and routing to alternate paths. This is defensible if you believe the retrieval completeness issue is transient (a slow database, a temporary API issue) and will resolve. The risk: the trend is worsening. If retrieval completeness continues to decline, more chains will hit the alternate path, then the contingency path, and eventually the emergency path. You also have 50 decisions already processed at below-warning quality, including chain-28491 where confidence inflation bypassed the per-agent integrity checks.

Option B (Tighten thresholds) is a precision response. You're not stopping the system; you're making the controls more sensitive. This is a good instinct, but it has a timing problem: changing thresholds in real-time during an active anomaly means you're tuning your controls while the conditions are abnormal. You risk setting thresholds that are appropriate for this anomaly but too aggressive for normal operation. A better approach is to tighten thresholds temporarily (with a scheduled revert) or to add the cross-agent consistency check as a hard gate rather than just a metric.

Option C (Pause and investigate) is the most conservative option. It is appropriate if you believe the 50 decisions made since the threshold crossing may include materially wrong approvals. The cost is operational: procurement decisions are delayed, which has business impact. The benefit is certainty: no more potentially flawed decisions while you investigate.

Option D (Partial pause with targeted intervention) is the most operationally nuanced option. It recognises that not all procurement decisions carry the same risk. Low-value, standard procurement can tolerate slightly lower data quality because the blast radius is small. High-value and high-risk procurement needs full data quality because the blast radius is large. This option lets you maintain partial operations while protecting against the worst outcomes.

A strong engineering response would be Option D, with these specific actions:

Immediate: Route high-value (>$50K) and high-risk (geopolitical, sole-source) procurement to manual processing. Keep standard procurement running with PACE controls active.
Immediate: Add the cross-agent consistency score as a hard gate (not just a metric) at the R-to-O boundary. Any chain with consistency below 0.70 routes to contingency (human review) regardless of individual receipt integrity.
Within 1 hour: Identify root cause of retrieval completeness decline. Check the geopolitical risk database for: query performance degradation, index issues, API rate limiting, data source availability.
Within 2 hours: Audit chain-28491 and all other chains where cross-agent consistency was below 0.80 since 08:30. Flag any procurement decisions that may need review.
After resolution: Conduct a post-incident review. Consider whether the cross-agent consistency check should be a permanent hard gate rather than an advisory metric.

What this exercises: The ability to use chain-level integrity metrics (not just per-agent metrics) to make operational decisions under uncertainty. The cross-agent consistency signal was the critical clue: it showed that even when individual agents passed their integrity checks, the chain as a whole was producing inflated confidence. This is the "vertical dimension" from the MASO framework in action.

Key takeaway: The Phantom Compliance scenario showed what happens with zero integrity instrumentation (15 days to detection). This exercise shows what happens with good instrumentation that catches most issues automatically (PACE) but surfaces edge cases that require engineering judgement. The instrumentation didn't eliminate the need for human decisions; it made those decisions informed, timely, and specific.

After the exercise¶

You have completed the Engineering Leads track. Here is the consolidated picture:

The golden thread¶

What breaks (Module 1): Multi-agent systems fail silently. Truncated retrievals, context overflow, stale data, delegation drift, and semantic drift all produce clean telemetry while delivering wrong results.
Why tools miss it (Module 2): Current observability stacks, including AI-specific tools like LangSmith and LangFuse, measure operational health, not reasoning integrity. They tell you the system is running, not that it's reasoning correctly.
Epistemic integrity (Module 3): The missing engineering requirement. Track what data each agent accessed, compute completeness ratios and confidence gaps, and propagate this metadata through verification receipts.
MASO controls (Module 4): The framework for building runtime controls. Circuit breakers that trip on integrity signals. PACE resilience patterns for graduated fallback. Boundary contracts for inter-agent data flow. Integration patterns for LangGraph, AutoGen, and CrewAI.
Instrumentation (Module 5): The specific metrics, dashboards, and tests. Retrieval completeness ratio, context utilisation, tool freshness, cross-agent consistency. Canary tests, chaos tests, chain integrity tests. An evidence pipeline that proves your controls work.

What you build next¶

Add retrieval completeness metrics to your most critical data sources
Implement verification receipts at your most critical agent boundaries
Build a chain integrity dashboard with the six core metrics
Write canary tests for the top three failure modes in your system
Establish a weekly evidence report that covers deployment, activity, effectiveness, and coverage

Final reflection

You started this track with a question: "What do I actually build?" Look back at your notes from each module. What is the single most important thing you learned that you didn't know before? What will you build first?

Go to the Convergence Exercise →

Back to all tracks →