Agentic AI: Graceful Degradation and PACE Resilience¶

This section defines the structured degradation path for autonomous AI systems. For stateless generative AI (chatbots, content tools), the Emergency response is simple: stop the service, route to fallback, fix, restart. For agentic AI, it's not.

This document uses the simplified three-tier system (Tier 1/2/3). See Risk Tiers — Simplified Tier Mapping for the mapping to LOW/MEDIUM/HIGH/CRITICAL.

Why Agents Need a Degradation Path¶

Agentic AI systems can't be stopped the way you stop a web server. When you shut down an agent, you may be interrupting:

Mid-transaction workflows — the agent has already committed state changes to external systems (database writes, API calls, file modifications) that need resolution
Held resources — database locks, API sessions, file handles, reserved capacity that need clean release
Multi-agent dependencies — other agents waiting for this agent's output, creating cascade risk
Scheduled operations — timer-triggered or event-triggered actions that continue even when you stop accepting new requests
Accumulated state — context, memory, and in-progress plans that represent hours of work and can't be cleanly discarded

"Turn it off" is not a resilience plan. It's a panic response that creates orphaned transactions, locked resources, and confused downstream systems. The degradation path provides structured alternatives between "fully autonomous" and "completely stopped."

Agentic AI Graceful Degradation Path

The Five Degradation Phases¶

Each phase reduces the agent's autonomy while maintaining as much business value as possible. The phase boundaries are defined by trigger criteria that are configured before deployment and monitored in production.

Phase 1: Normal¶

Aspect	Detail
Autonomy level	Full, within defined boundaries
What the agent can do	Execute actions, call tools, make decisions, manage state
Controls active	Guardrails (P), Judge (P), Human Oversight on exception only
Human role	Monitor dashboards. Review exceptions. Periodic audit sampling.
Trigger to Phase 2	Judge flags behavioural anomaly; guardrail bypass rate exceeds threshold; action pattern deviates from baseline; external threat intelligence triggers review

Phase 2: Constrained¶

Aspect	Detail
Autonomy level	Reduced scope
What the agent can do	Read-only tool access. Narrowed action space. Can still analyse, draft, recommend — but cannot execute changes.
Controls active	Guardrails (A — tightened thresholds), Judge (P — evaluating 100%), Human Oversight reviewing all outputs
Human role	Review all agent outputs. Approve any scope changes. Assess whether to restore or degrade further.
What changes	Write permissions revoked. Tool access restricted to read-only. Guardrail thresholds tightened. All outputs queued for review.
Trigger to Phase 3	Constraint breach detected; control failure in any layer; repeated Judge flags after constraint; human reviewer escalates concern

Phase 3: Supervised¶

Aspect	Detail
Autonomy level	Propose only
What the agent can do	Draft action plans for human approval. No autonomous execution of any kind. Still provides analytical value.
Controls active	Guardrails (A), Judge (A — priority evaluation), Human Oversight (P — approving every action)
Human role	Approve or reject every proposed action. The agent is an assistant, not an actor.
What changes	Agent cannot call any tool without explicit human approval. Every proposed action is logged with the human decision.
Trigger to Phase 4	Integrity compromise confirmed; adversarial breach evidence; agent proposing actions that violate constraints even in supervised mode

Phase 4: Bypassed¶

Aspect	Detail
Autonomy level	Isolated
What the agent can do	Nothing. Agent sessions quarantined.
Controls active	Circuit breaker active. Non-AI fallback path serving traffic.
Human role	Operate manual or rule-based process for business continuity. Security team investigates the agent.
What changes	All traffic routed to non-AI fallback. Agent isolated from all systems. Audit logs and agent state preserved for forensic analysis.
Trigger to Phase 5	Non-AI fallback path also compromised; investigation reveals fundamental design flaw requiring full rebuild; regulatory order to cease

Phase 5: Full Stop¶

Aspect	Detail
Autonomy level	None
What the agent can do	Nothing. All sessions terminated.
Controls active	None. Service halted.
Human role	Incident response. Regulatory notification. Post-incident review.
What changes	Service unavailable. All agent sessions terminated with transaction resolution. Audit logs and state snapshots immutable and preserved. Stakeholders notified.
Trigger to recovery	Root cause identified. Fix implemented and validated. Degradation path tested. Risk function sign-off. Phased restart through Supervised → Constrained → Normal.

Transaction Resolution¶

Before an agent can transition to a lower phase, in-flight transactions must be resolved. For each tool in the agent's permission set, the architect must define the resolution strategy.

Transaction Resolution Matrix¶

For every tool or system the agent can access, document:

Question	If Yes	If No
Can the action be rolled back?	Include rollback in the phase transition procedure. Automate where possible.	Document that the action is irreversible and must be completed or abandoned with defined consequences.
Can the action be completed safely without the agent?	A human or rule-based system completes it as part of the transition.	The action must be abandoned. Document the consequences and notification requirements.
Is partial completion dangerous?	The transition procedure must either complete or roll back before the agent is isolated. No partial states allowed.	Partial state can be left for human resolution after transition.
Does the action hold external locks or resources?	Release procedure must be part of the transition. Define timeout for automated release.	No resource cleanup needed.

Example: Agent Processing Customer Loan Applications¶

Tool / System	Rollback?	Complete without agent?	Partial dangerous?	Resolution
CRM record update	Yes	Yes (human)	No	Roll back uncommitted changes. Human completes any in-progress updates.
Credit bureau API query	No (read-only)	N/A	No	Let query complete. Discard results if not yet processed.
Decisioning engine submission	Yes (within window)	Yes (human resubmits)	Yes (partial submission corrupts record)	Must either complete submission or roll back entirely. No partial state.
Customer notification email	No (once sent)	Yes (human sends)	No	If queued but not sent, hold for human review. If sent, log and accept.
Document generation	No	Yes (human regenerates)	No	Discard incomplete documents. Human regenerates if needed.

Multi-Agent Cascade Prevention¶

When one agent in a multi-agent system transitions to a lower phase, the impact on other agents must be contained. Without explicit cascade prevention, one agent's shutdown can propagate unpredictably.

Design Requirements¶

1. Timeout and fallback for all inter-agent communication.

Every agent that waits for another agent's output must have: - A defined timeout (not indefinite wait) - A fallback behaviour when the timeout expires (return cached result, return error, escalate to human) - No assumption that the other agent will respond

2. Orchestrator awareness of agent health.

The orchestrator (or coordination layer) must: - Detect when an agent transitions to a lower phase - Redistribute or queue pending work that was assigned to the degraded agent - Prevent new work from being assigned to the degraded agent - Notify dependent agents of the state change

3. No blocking dependencies.

No agent should be able to indefinitely block another agent's progress. If Agent A needs Agent B's output to proceed: - Agent A must have a timeout - Agent A must have a degraded-mode path that works without Agent B's output - The system must not deadlock if both agents are waiting for each other

4. Phase transitions are observable.

All phase transitions must be: - Logged as auditable events with timestamp, trigger, and authoriser - Visible to the orchestrator and all dependent agents - Communicated to the human oversight function

State Preservation for Post-Incident Analysis¶

When the degradation path activates (any transition below Normal), the following must be preserved before the agent's state changes:

Artefact	Why	How
Agent memory and context	Understand what the agent "knew" when the anomaly occurred	Snapshot to immutable storage before phase transition
In-progress plans and pending actions	Determine what the agent was about to do	Export from orchestrator queue
Tool call history	Trace what external actions were taken	API call logs with timestamps, parameters, and responses
Judge evaluation state	Understand what the Judge flagged and when	Export Judge scores, confidence levels, and flag reasons
Guardrail hit log	Identify what was blocked and what passed	Export guardrail engine logs for the session
Inter-agent messages	Trace multi-agent coordination for cascade analysis	Export from message bus or orchestrator

All preservation must be automated. The agent must not be able to modify its own logs during a phase transition. Immutability is enforced at the infrastructure layer.

Recovery: Stepping Back Up¶

Recovery is not "restart and hope." It's a phased return through the degradation path:

Step	Action	Validation
1	Root cause identified and fix implemented	Documented in incident report
2	Fix validated in non-production environment	Test suite passes; adversarial test suite passes
3	Restart in Supervised phase (agent proposes, human approves)	Run for defined period (minimum 4 hours at Tier 2, 24 hours at Tier 3) with production traffic
4	Promote to Constrained phase (reduced scope, enhanced monitoring)	Judge confidence scores within baseline for defined period
5	Promote to Normal phase	All control layers healthy; monitoring confirms baseline behaviour; risk function sign-off (Tier 3)

At Tier 3, each step-up requires explicit authorisation. The agent does not return to Normal automatically.

Pre-Deployment Requirements¶

No agentic AI system should enter production without:

All five degradation phases defined with trigger criteria
Transaction resolution matrix completed for every tool in the agent's permission set
Multi-agent cascade prevention designed and tested (if multi-agent)
State preservation automation validated
Non-AI fallback path documented, tested, and staffed
Recovery (step-back-up) procedure documented with authorisation gates
Full degradation walkthrough completed with production-equivalent scenario

AI Runtime Behaviour Security, 2026 (Jonathan Gill).