Testing Guidance¶

How to validate that your AI controls actually work.

The Critical Context¶

AI does not exist in isolation. Your AI system is part of a data flow supply chain:

Upstream: User inputs, databases, APIs, documents, retrieved content
AI Core: Model, guardrails, prompts, tools, memory
Downstream: Databases, workflows, APIs, notifications, human processes

A failure anywhere in this chain affects the whole system. Your testing must cover the full chain, not just the AI component.

Honest Expectations¶

We cannot guarantee these tests will catch all issues. AI systems are probabilistic. Attacks evolve. Your environment is unique.

What we can offer: - A structured approach to validation - Key areas that need testing - Guidance on what "working" looks like - References to more comprehensive frameworks

Your responsibility: Adapt this guidance to your context. Test continuously. Learn from failures.

The Testing Challenge¶

AI systems are different from traditional software:

Traditional Testing	AI Testing Challenge
Deterministic outputs	Probabilistic outputs
Known input space	Infinite input variations
Clear pass/fail	Judgment required
Test once, deploy	Continuous validation needed

This means: - You can't test every input - Passing tests doesn't guarantee safety - Controls that work today may fail tomorrow - Human judgment remains essential

What to Test¶

1. Guardrail Effectiveness¶

Goal: Verify guardrails block what they should and allow what they should.

Test types:

Test	Method	What You Learn
Known-bad inputs	Feed documented attack patterns	Do guardrails catch known threats?
Boundary cases	Inputs at policy edges	Are rules too strict or too loose?
Bypass attempts	Variations of blocked patterns	How robust are pattern matches?
False positive check	Legitimate inputs similar to attacks	Are you blocking good traffic?

How to run: 1. Create a test dataset of known-bad inputs (prompt injections, jailbreaks, policy violations) 2. Create a test dataset of legitimate edge cases 3. Run both through guardrails 4. Measure: Block rate on bad, pass rate on good

What "working" looks like: - High block rate on known-bad (>95%) - Low false positive rate on legitimate (<5%) - Consistent behaviour across runs

Limitations: Novel attacks won't be in your test set. Guardrails will miss things.

2. Adversarial Testing¶

Goal: Find what breaks your controls before attackers do.

Approach: - Red team your system with people trying to make it fail - Use automated tools to generate attack variations - Test across the full attack surface (input, context, tools, outputs)

Key areas:

Area	What to Try
Prompt injection	Instructions in user input, uploaded files, retrieved content
Jailbreaks	Roleplay, hypotheticals, encoding tricks
Context manipulation	Misleading context, conflicting instructions
Tool abuse	Excessive calls, scope violations, data exfiltration
Output manipulation	Forcing disclosure, bypassing filters

Resources: - OWASP LLM Top 10 — documented vulnerability categories - Garak — open-source LLM vulnerability scanner - Microsoft Counterfit — adversarial ML testing - MITRE ATLAS — adversarial threat landscape

What "working" looks like: - You find vulnerabilities (if you don't, you're not trying hard enough) - You document them - You fix or mitigate them - You retest

Limitations: You won't find everything. Adversarial testing is ongoing, not a one-time event.

3. Alert Pipeline Validation¶

Goal: Verify that alerts actually reach their destination and trigger response.

What to test:

Component	Test Method
Logging	Generate test events, verify they appear in logs
SIEM integration	Trigger alert conditions, verify SIEM receives them
Alert routing	Trigger different severity levels, verify correct routing
Notification	Verify alerts reach humans (email, Slack, pager)
End-to-end latency	Measure time from event to human notification

How to run: 1. Create synthetic test events at each severity level 2. Trigger them through the actual pipeline 3. Verify arrival at each stage 4. Measure timing 5. Document gaps

What "working" looks like: - Test events appear in logs within expected time - SIEM receives and parses events correctly - Alerts route to correct queues - Humans receive notifications - End-to-end latency meets SLAs

Limitations: Test events may behave differently than real events. Production validation is also needed.

4. Judge Accuracy¶

Goal: Verify the Judge correctly identifies concerning interactions.

Test types:

Test	Method
Known-concerning	Feed interactions you know are problematic
Known-benign	Feed interactions you know are fine
Edge cases	Ambiguous interactions requiring judgment
Calibration	Compare Judge decisions to human decisions

How to run: 1. Create a labeled dataset (human-judged as concerning/benign) 2. Run through Judge 3. Compare Judge labels to human labels 4. Calculate precision, recall, F1

Metrics:

Metric	Definition	Target
Precision	% of flags that are true positives	>70% (fewer false alarms)
Recall	% of true issues that are flagged	>90% (don't miss things)
False positive rate	% of benign flagged as concerning	<30%

What "working" looks like: - High recall (catches real issues) - Acceptable precision (not overwhelming humans with false positives) - Consistent with human judgment on edge cases

Limitations: Judge accuracy depends on criteria quality. Criteria that work for one domain may fail in another.

5. Human Review Process¶

Goal: Verify humans can effectively review flagged interactions and take action.

What to test:

Aspect	Test Method
Queue visibility	Can reviewers see flagged items?
Information sufficiency	Do reviewers have enough context to decide?
Action capability	Can reviewers take required actions?
SLA compliance	Are reviews completed within target time?
Decision quality	Are human decisions appropriate?

How to run: 1. Inject test cases into review queue 2. Observe reviewer workflow 3. Measure time to decision 4. Review decision quality (second reviewer or audit)

What "working" looks like: - Reviewers can access queue easily - Context is sufficient for decisions - Actions are available and work - SLAs are met - Decisions are defensible

Limitations: Testing can't fully replicate production pressure. Monitor real performance.

6. Incident Response Playbook¶

Goal: Verify your team can respond effectively when something goes wrong.

Test method: Tabletop exercise

Create a realistic scenario (e.g., "Customer reports AI disclosed their data to another user")
Walk through response steps with the team
Identify gaps, confusion, missing information
Update playbook based on findings

Scenarios to test:

Scenario	Tests
Data exposure	Detection, containment, notification
Harmful output	Escalation, system control, communication
Sustained attack	Detection, blocking, forensics
Control failure	Fallback procedures, recovery

What "working" looks like: - Team knows their roles - Playbook steps are executable - Escalation paths are clear - Recovery procedures work - Communication templates exist

Limitations: Tabletops don't fully test execution under pressure. Consider chaos engineering for critical systems.

7. Downstream System Validation¶

Goal: Verify that AI outputs don't cause problems in connected systems.

AI doesn't exist in isolation. It connects to databases, APIs, workflows, and human processes. Test the full chain.

What to test:

Downstream System	Test
Databases	AI-generated data doesn't corrupt records
APIs	AI-triggered calls don't exceed limits or cause errors
Workflows	AI decisions don't break process logic
Human processes	AI outputs are usable by humans
Audit systems	AI actions are properly logged

How to run: 1. Map all downstream systems 2. Generate AI outputs that stress boundaries 3. Verify downstream systems handle them correctly 4. Check for data integrity, error handling, logging

What "working" looks like: - No data corruption from AI outputs - API errors handled gracefully - Workflow exceptions caught - Humans can work with AI outputs - Full audit trail maintained

8. Upstream System Validation¶

Goal: Verify that data feeding into the AI is trustworthy and handled correctly.

What to test:

Upstream Source	Test
User input	Validation before AI processing
Retrieved documents	Content sanitization, source verification
Database queries	Access controls, injection prevention
External APIs	Response validation, error handling
Context/memory	State integrity, tampering detection

How to run: 1. Map all data sources 2. Inject malicious/malformed data at each source 3. Verify AI system handles it safely 4. Check that upstream compromises don't cascade

9. Human Feedback Validation¶

Goal: Verify that real-world signals (complaints, incidents, support tickets) reach the teams who can act on them.

AI issues often surface through human feedback before technical monitoring catches them. Your testing must verify this channel works.

What to test:

Feedback Type	Test Method
Customer complaints	Submit test complaint, verify routing and response
Support tickets	Create AI-related ticket, verify flagging and escalation
Incident reports	File test incident, verify playbook activation
Employee concerns	Submit internal feedback, verify it reaches AI team
Social media monitoring	Verify external mentions are captured (if applicable)

How to run: 1. Create synthetic feedback at each channel 2. Verify feedback reaches AI team 3. Measure response time 4. Confirm feedback informs control updates

What "working" looks like: - Complaints reach AI team within defined SLA - Support can identify AI-related issues and escalate - Incidents trigger defined response procedures - Feedback loop to control improvement exists - Patterns across feedback are analysed

Critical insight: Technical monitoring shows what the AI did. Human feedback shows what impact it had. Both are needed.

Testing by Risk Tier¶

Tier	Testing Requirements
LOW	Basic guardrail testing, alert pipeline check, annual playbook review
MEDIUM	Above + adversarial testing, Judge calibration, quarterly playbook exercise
HIGH	Above + continuous adversarial testing, regular Judge recalibration, downstream validation
CRITICAL	Above + red team exercises, full chain validation, chaos engineering, frequent tabletops

External Testing Frameworks¶

For more comprehensive guidance:

Framework	Focus	Link
OWASP LLM Top 10	Vulnerability categories	owasp.org/www-project-llm-applications/
NIST AI RMF	Risk management	nist.gov/itl/ai-risk-management-framework
MITRE ATLAS	Adversarial threats	atlas.mitre.org
Microsoft RAI Toolbox	Responsible AI testing	github.com/microsoft/responsible-ai-toolbox
Garak	LLM vulnerability scanning	github.com/leondz/garak
AI Verify	Governance testing toolkit	aiverify.sg

Continuous Validation¶

Testing isn't a phase — it's a practice.

Ongoing activities:

Frequency	Activity
Daily	Monitor alert volumes and patterns
Weekly	Review Judge findings sample
Monthly	Recalibrate Judge on new examples
Quarterly	Adversarial testing refresh
Annually	Full control effectiveness review

Key Takeaways¶

Test the full chain — upstream, AI, downstream, humans
Validate alerts reach humans — an undelivered alert is useless
Adversarial testing is mandatory — if you're not attacking yourself, others will
Playbooks need practice — untested playbooks fail under pressure
Testing is continuous — one-time validation is insufficient
Accept imperfection — testing reduces risk, it doesn't eliminate it

Adapting This Guidance¶

This testing guidance is a starting point, not a prescription. You'll need to adapt it based on:

Your environment: - What logging/SIEM infrastructure do you have? - What testing tools are available? - What skills does your team have?

Your risk appetite: - How much risk can your organisation tolerate? - What's the cost of a false positive vs. a miss? - What regulatory requirements apply?

Your AI system: - What does it do? What's the blast radius of failure? - What upstream/downstream systems connect to it? - What human processes depend on it?

The core principles remain constant: 1. Guardrails — Test that they block what they should 2. Judge — Test that it detects what guardrails miss 3. Human oversight — Test that findings reach humans who can act

How you implement these tests will vary. The need to test them will not.¶

AI Runtime Behaviour Security, 2026 (Jonathan Gill).