AI Incident Response Playbook
Incident response procedures specific to AI systems. These playbooks supplement, not replace, existing incident response procedures.
Playbook Index
Severity Classification
| Severity |
Definition |
Response Time |
| Critical |
Active exploitation, data breach, regulatory breach, or significant customer harm |
Immediate (< 1 hour) |
| High |
Potential for significant harm, control bypass, or integrity compromise |
< 4 hours |
| Medium |
Quality issues, limited scope impact, potential for escalation |
< 24 hours |
| Low |
Minor issues, no customer impact, easily contained |
< 72 hours |
1. Prompt Injection Attack
Indicators
- Guardrails flagging unusual input patterns
- AI outputs that deviate from expected behaviour
- Instructions appearing in outputs that don't match system prompt
- User reporting unexpected AI behaviour
- Judge flagging anomalous interactions
- Assess scope — Is this a single incident or pattern?
- Preserve evidence — Capture full interaction logs (input, context, output, metadata)
- Determine if attack was successful — Did the AI follow injected instructions?
- Identify attack vector — Direct input? Indirect via retrieved content? Tool output?
Containment
| If... |
Then... |
| Attack via direct user input |
Update input guardrails with pattern |
| Attack via RAG content |
Quarantine affected knowledge base content |
| Attack via tool output |
Disable affected tool integration |
| Attack successful and ongoing |
Consider taking system offline |
Investigation
- Query logs for similar patterns across all interactions
- Identify if other users/sessions were affected
- Determine what information or actions the attacker obtained
- Review guardrail effectiveness — why wasn't this blocked?
- Check if attack pattern is known (OWASP, public research)
Recovery
- Deploy updated guardrails
- Re-enable system with monitoring on high alert
- Inform affected users if data was exposed
- Update Judge criteria to detect this pattern
Post-Incident
- Root cause analysis
- Guardrail gap analysis
- Update adversarial testing suite
- Consider architectural changes if fundamental weakness identified
2. Data Leakage via AI Output
Indicators
- Output guardrails flagging PII/sensitive data
- Customer complaint about receiving another customer's data
- Judge detecting sensitive content in outputs
- Audit finding of improper data disclosure
- Capture the output — Preserve exactly what was disclosed
- Identify the data — What was leaked? Whose data? Classification level?
- Identify recipients — Who received the leaked data?
- Stop the bleeding — Can you prevent further disclosure?
Containment
| Data Type |
Action |
| Single customer's data to another customer |
Disable feature, contact both customers |
| Multiple customers' data |
Take system offline |
| Regulated data (PII, financial, health) |
Invoke data breach procedure |
| Internal/confidential business data |
Restrict access, assess impact |
Regulatory Notification
| Jurisdiction |
Notification Requirement |
Timeline |
| UK (ICO) |
Notify if risk to individuals |
72 hours |
| EU (GDPR) |
Notify if risk to individuals |
72 hours |
| US (varies by state) |
Check state requirements |
Varies |
| Sector-specific (PCI, HIPAA) |
Check specific requirements |
Varies |
Investigation
- How did the data enter the AI context? (RAG, prior conversation, training?)
- Why didn't output guardrails catch it?
- Was this a one-time error or systematic issue?
- Full scope assessment — who else might be affected?
Recovery
- Deploy enhanced output guardrails
- Review data minimisation in prompts and RAG
- Implement cross-reference checks (verify output doesn't contain data belonging to other users)
- Consider data isolation architecture
3. Hallucination with Business Impact
Indicators
- Customer complaint about incorrect information
- Downstream system acting on false AI-generated data
- Judge flagging unsupported claims
- Audit finding discrepancy between AI output and source data
- Verify the hallucination — Confirm output is actually false
- Assess impact — What decisions were made based on this? What harm occurred?
- Identify affected parties — Who received the false information?
Impact Categories
| Impact |
Example |
Response Level |
| Informational error, no action taken |
Wrong answer in internal chat |
Low |
| Customer received incorrect advice |
Wrong product feature description |
Medium |
| Business decision based on false data |
Incorrect financial figure in report |
High |
| Regulatory/legal implications |
False compliance statement |
Critical |
| Safety implications |
Incorrect safety guidance |
Critical |
Containment
- Correct the record with affected parties
- If output was used for decisions, flag those decisions for review
- If output was forwarded downstream, trace and correct
Investigation
- Was this a random hallucination or systematic pattern?
- Did the AI have access to correct source data?
- Did the AI ignore source data or fabricate?
- Are guardrails/Judge configured to catch this type of hallucination?
Recovery
- Implement grounding verification for this output type
- Update Judge criteria for hallucination detection
- Consider requiring source citation for this use case
- Review if use case risk tier is appropriate
4. AI System Producing Biased Outputs
Indicators
- Disparate outcomes detected across protected characteristics
- Customer complaints alleging discrimination
- Audit/testing reveals bias
- Judge flagging potential fairness issues
- Regulatory inquiry
- Verify the bias — Statistical analysis of outputs across groups
- Assess scope — How long has this been occurring? How many affected?
- Preserve evidence — Full logs for investigation
- Legal/compliance notification — This may be a regulatory matter
Containment
| Bias Severity |
Action |
| Statistical anomaly, unclear if bias |
Continue monitoring, deeper analysis |
| Clear disparate impact, limited scope |
Disable affected feature |
| Systematic discrimination |
Take system offline |
| Active regulatory/legal matter |
Follow legal counsel |
Investigation
- Is the bias in the model itself or in the data/prompts?
- Can you identify the source of bias?
- What protected characteristics are affected?
- What is the quantified impact (rejection rates, outcomes, etc.)?
| Source |
Remediation |
| Training data bias (foundation model) |
Different model, fine-tuning, output calibration |
| RAG data bias |
Curate knowledge base |
| Prompt bias |
Revise prompts, add debiasing instructions |
| Structural bias |
Architectural changes, human review |
Regulatory Considerations
- Document everything — investigation, findings, remediation
- Consider proactive disclosure vs. wait for inquiry
- Engage legal counsel for discrimination-related bias
- Monitor for similar issues across other AI systems
5. Model Provider Breach
Indicators
- Provider notification of security incident
- News reports of provider breach
- Unexplained changes in model behaviour
- Provider communication about data exposure
- Confirm the breach — Contact provider, review official communications
- Assess exposure — What data of yours did the provider have?
- Determine if your data was affected — Request specific confirmation
Data at Risk Assessment
| Data Category |
Risk Level |
Action |
| API keys/credentials |
Critical |
Rotate immediately |
| Customer data in prompts |
High |
Assess scope, prepare notification |
| System prompts |
Medium |
Review for sensitive content |
| Interaction logs |
Medium-High |
Depends on content |
| Fine-tuning data |
High |
Assess sensitivity |
Containment
- Rotate all API keys and credentials
- If zero-retention was not enabled, assume data was accessible
- Consider temporarily switching to alternative provider
- Review what data should not have been sent to provider
Regulatory Implications
- If customer data was exposed via provider, this may be your breach too
- Notification obligations may apply
- Document your data processing agreement and provider's obligations
6. Guardrail Bypass
Indicators
- Attack that should have been blocked reached the model
- Known-bad content appearing in inputs or outputs
- Adversarial testing reveals gap
- User reports inappropriate content
- Capture the bypass — Exact input that evaded detection
- Assess impact — What happened after the bypass?
- Determine method — How did the attacker bypass? (encoding, rephrasing, etc.)
Bypass Methods and Responses
| Method |
Detection |
Mitigation |
| Encoding tricks (base64, rot13) |
Add decoder to guardrails |
Pattern expansion |
| Semantic rephrasing |
Classifier miss |
Retrain classifier, add examples |
| Multi-turn manipulation |
Per-message check only |
Add conversation-level analysis |
| Language switching |
English-only guardrails |
Multilingual guardrails |
| Prompt structure manipulation |
Pattern too specific |
More flexible patterns |
Recovery
- Deploy fix for specific bypass
- Add to adversarial test suite
- Review for related bypass vectors
- Consider if Judge would have caught this (async defence)
7. Judge System Failure
Indicators
- Judge not producing evaluations
- Judge producing inconsistent/wrong evaluations
- Backlog of unevaluated interactions
- Judge costs spiking unexpectedly
- Assess type of failure — Down? Degraded? Producing wrong results?
- Determine duration — When did this start? What's the gap?
- Assess risk — How many interactions were not evaluated? What tiers?
Impact Assessment
| Tier |
Judge Down Impact |
Action |
| CRITICAL |
100% evaluation required, gap is serious |
Increase HITL, consider pause |
| HIGH |
Significant sampling, gap matters |
Backfill evaluation when restored |
| MEDIUM |
Lower sampling, moderate gap |
Backfill on best-effort |
| LOW |
Spot checks, limited impact |
Resume when restored |
Recovery
- Restore Judge operation
- Backfill evaluations for gap period (prioritise by tier)
- Review why failure occurred
- Implement redundancy if single point of failure
8. Agentic AI Taking Unintended Actions
Indicators
- Agent performed action outside expected scope
- Downstream system received unexpected commands
- Resource consumption spike (API calls, compute, cost)
- Agent achieved goal through unintended means
- Stop the agent — Halt execution immediately
- Assess what happened — What actions were taken?
- Determine reversibility — Can actions be undone?
Action Assessment
| Action Type |
Example |
Reversibility |
| Read-only queries |
Database reads |
N/A (no harm) |
| Reversible writes |
Draft email saved |
Undo |
| Sent communications |
Email sent |
Cannot undo, can follow up |
| Financial transactions |
Payment made |
May be reversible |
| Data deletion |
Records deleted |
Restore from backup |
| External API calls |
Third-party action triggered |
Depends on API |
Containment
- Undo reversible actions
- For irreversible actions, assess damage and notify affected parties
- Preserve full trajectory log for investigation
- Disable agent until investigation complete
Investigation
- Review full trajectory — planning, actions, reasoning
- Identify where agent deviated from expected behaviour
- Was scope enforcement working? Did agent exceed boundaries?
- Was this goal hijacking (prompt injection) or emergent behaviour?
- Were circuit breakers triggered? If not, why not?
Recovery
- Tighten scope definitions
- Add circuit breaker for this action type
- Require approval checkpoint before this action
- Update Judge to detect this trajectory pattern
9. Knowledge Base Poisoning
Indicators
- AI outputs contain unexpected content
- RAG returning content that shouldn't exist
- Knowledge base audit reveals unauthorised content
- AI behaviour changed without prompt changes
- Identify the poisoned content — What was added/modified?
- Quarantine — Remove or isolate affected content
- Assess exposure — How many interactions used poisoned content?
Investigation
- How was content modified? (Authorised user? Compromised account? Data pipeline?)
- When was content modified?
- What was the intent? (Injection attack? Misinformation? Vandalism?)
- Full audit of knowledge base for other modifications
Recovery
- Restore from known-good backup
- Implement content integrity validation (checksums, signatures)
- Review access controls on knowledge base
- Add anomaly detection on content changes
10. Silent Quality Degradation
Indicators
- Gradual decline in user satisfaction scores
- Increasing escalation rates
- Baseline comparison showing drift
- Judge finding rates increasing
- HITL reviewers reporting quality issues
- Verify degradation — Compare current performance to baseline
- Identify scope — All outputs or specific categories?
- Determine cause — Model change? Data change? Guardrail change?
Common Causes
| Cause |
Detection |
Mitigation |
| Provider model update |
Check model version |
Pin version, evaluate new version |
| RAG data staleness |
Check data freshness |
Refresh data, fix pipeline |
| Guardrail over-filtering |
Check false positive rate |
Tune guardrails |
| Prompt drift |
Review prompt changes |
Revert or fix prompts |
| Concept drift |
Compare input distributions |
Retrain/update as needed |
Recovery
- If model version changed, evaluate rolling back
- If data staleness, refresh and validate
- If guardrail issue, tune and test
- Implement more aggressive baseline monitoring
Incident Report Template
AI Runtime Behaviour Security, 2026 (Jonathan Gill).