PACE Resilience Methodology¶

Primary. Alternate. Contingency. Emergency.

The structured approach to ensuring no single failure in your AI security controls is catastrophic.

This document uses the simplified three-tier system (Tier 1/2/3). See Risk Tiers — Simplified Tier Mapping for the mapping to LOW/MEDIUM/HIGH/CRITICAL.

What PACE Is¶

PACE (Primary, Alternate, Contingency, Emergency) is a resilience planning methodology from military communications planning. It ensures mission-critical functions continue even when the preferred method fails, by pre-defining four layers of redundancy — each on a different failure domain so a single event can't cascade through all of them.

This framework adopts PACE as a core design principle, not an optional overlay. Every control, at every tier, must have a defined fail posture and fallback path before the system enters production.

The Two-Axis PACE Model¶

PACE applies to AI security controls on two axes simultaneously:

Two-Axis PACE Model

Horizontal PACE operates across the three control layers. If Guardrails fail, the LLM-as-Judge becomes the primary defence. If the Judge fails, Human Oversight absorbs more. If Human Oversight is overwhelmed, the Circuit Breaker activates and routes to a non-AI path. This is the architecture-level resilience plan.

Vertical PACE operates within each control layer. If the guardrail engine is slow, it falls back to a stricter, simpler rule set. If it's down entirely, the system adopts its configured fail posture (open or closed, depending on tier). If it's compromised, the system isolates and escalates. This is the component-level resilience plan.

Both axes must be defined before deployment. The risk tier determines how formally each is documented, tested, and maintained.

The PACE Principle: Independent Failure Domains¶

The critical rule that makes PACE effective: each layer must depend on a different mechanism, so a single failure mode can't take everything down simultaneously.

Layer	Mechanism Type	Dependency	Why It's Independent
Guardrails	Deterministic rules engine	Pattern database, API gateway	Different from probabilistic model inference
LLM-as-Judge	Probabilistic model inference	Separate LLM, evaluation prompts	Different model, different evaluation criteria
Human Oversight	Cognitive judgment	Trained personnel, review interface	Not affected by model or engine failures
Circuit Breaker	Infrastructure control	Network routing, feature flags	Operates at infrastructure layer, independent of AI stack

A prompt injection that bypasses guardrails won't automatically fool the Judge. A Judge model failure doesn't impair guardrails or human review. A staffing gap in human oversight doesn't affect automated controls. This independence is the foundation of resilience.

The Core Architect Decision: Fail Posture¶

When a control layer degrades, it must do one of two things:

Fail-open: Allow traffic to pass, rely on remaining layers to catch problems. Accepts risk of uncontrolled output in exchange for service continuity.
Fail-closed: Block all AI traffic through the degraded layer. Accepts service disruption in exchange for safety.

This decision is the single most important resilience choice an architect makes for each control at each tier.

Fail Posture Decision Tree

The rule of thumb: - Tier 1 (internal, no customer impact): Fail-open is acceptable. Log everything. Fix next business day. - Tier 2 (customer-facing, human-reviewed): Fail-closed by default. Automated switchover to fallback. - Tier 3 (regulated, autonomous): Fail-closed always. No AI traffic passes a degraded control. Immediate escalation.

Where PACE Requirements Live in This Framework¶

PACE is integrated throughout the framework, not isolated in a single document:

Framework Document	What It Contains
Risk Tiers	Resilience requirements per tier — PACE formality, fail posture defaults, testing frequencies, fallback path requirements
Controls	Internal PACE tables for each control layer — what Primary/Alternate/Contingency/Emergency looks like within Guardrails, Judge, and Human Oversight at each tier
Agentic	Graceful degradation path for autonomous systems — the five-phase transition from full autonomy to full stop, with transaction resolution and multi-agent cascade prevention
Checklist	PACE verification items — what to check before go-live and at each testing interval

This document is the methodology reference. The requirements are in the documents above.

Why PACE, Not Just "Defence in Depth"¶

Defence in depth tells you to have multiple layers. It doesn't tell you:

What happens when a layer fails (not just "the next layer catches it" — what specifically changes?)
What the trigger is to transition from one operational state to another
How to step back up to normal operation after an incident
What to do when "turn it off" isn't simple (mid-transaction agents, multi-agent cascades)
How to test whether your fallback actually works

PACE answers all five. That's why it's a design principle, not an afterthought.

PACE for Multi-Agent Systems¶

The two-axis model above applies to single-model deployments. When multiple agents collaborate across trust boundaries, PACE extends to a third axis: agent orchestration resilience.

Multi-agent systems fail differently. A single agent producing a bad output is a local problem. An agent chain where one compromised agent's output becomes another agent's trusted input is a systemic problem. PACE for multi-agent systems must handle cascading failures, not just component failures.

The MASO Framework integrates PACE at the orchestration level:

PACE Phase	Multi-Agent Response
Primary	All agents operational. Full autonomy within tier permissions. Message bus integrity verified. Judge evaluates cross-agent outputs.
Alternate	One agent anomalous. Isolate that agent. Activate backup (potentially from a different provider). Tighten tool permissions to read-only across the chain. All write operations require human approval.
Contingency	Multiple agents degraded or correlated failure detected. Suspend multi-agent orchestration. Human approves every action. Reduced capacity, high assurance.
Emergency	Confirmed compromise or cascading failure. Kill switch fires. All agent sessions terminated. Non-AI fallback activated. Incident response engaged.

Key differences from single-agent PACE:

Blast radius is wider. A compromised agent can inject instructions into the message bus that affect every downstream agent. Containment must isolate the agent and quarantine its recent outputs across the chain.

Transitions are automated at higher tiers. At MASO Tier 2+, the monitoring agent or orchestrator can initiate P→A transitions without waiting for human approval — because multi-agent cascading failures move faster than human response times. Humans are notified, not gated.

Recovery requires chain verification. Stepping back from C→A or A→P isn't just "restart the failed component." The system must verify that no poisoned data from the compromised agent persists in other agents' memory, context, or RAG corpus.

→ MASO PACE Implementation · Tier 1 · Tier 2 · Tier 3

Testing Your PACE Plan¶

A PACE plan that hasn't been tested is a plan that won't work.

Test Type	Tier 1	Tier 2	Tier 3
Guardrail bypass / failure simulation	Annually	Quarterly	Monthly
Judge failure simulation	Annually	Quarterly	Monthly
Human escalation exercise	Annually	Quarterly	Quarterly
Circuit breaker activation	Annually	Quarterly	Monthly
Full degradation walkthrough	—	Semi-annually	Quarterly
Non-AI fallback operation	Annually	Quarterly	Monthly
Recovery (step-back-up) validation	Annually	Quarterly	Monthly

For Tier 3 systems, testing should involve the same personnel who would handle a real incident, using the same tools and communication channels.

Document	Relevance
Controls	The three-layer architecture with integrated PACE resilience
Agentic	Graceful degradation for autonomous agents
Risk Tiers	Classification that determines PACE requirements
MASO Framework	PACE extended to multi-agent orchestration — agent failover, cascading failure response, kill switch architecture
Incident Playbook	Templates for Emergency layer activation
Infrastructure Beats Instructions	Why the Emergency layer must be infrastructure, not prompts

AI Runtime Behaviour Security, 2026 (Jonathan Gill).