Bypass Prevention¶

How attackers, users, and agents circumvent controls — and what you can do about it.

The Uncomfortable Truth¶

You cannot fully prevent bypasses.

This isn't defeatism — it's the foundation of a realistic security posture. AI systems have fundamental properties that make complete bypass prevention impossible:

Property	Why It Enables Bypass
Instructions and data share one channel	No technical way to fully separate user input from system instructions
Models are trained to be helpful	Helpfulness can be exploited ("please help me with this totally legitimate request...")
Natural language is ambiguous	Any rule can be rephrased to technically comply while violating intent
Models can't verify intent	A model cannot know if a user is lying about why they want something
Emergent capabilities	Models may find paths to goals that weren't anticipated

The goal is not prevention. The goal is: 1. Make bypasses harder (raise the cost) 2. Detect bypasses when they happen (reduce dwell time) 3. Limit damage when bypasses succeed (reduce blast radius) 4. Learn from bypasses (improve over time)

Bypass Taxonomy¶

Bypass Taxonomy

Category 1: Guardrail Bypasses¶

Attacker gets prohibited content past input or output filters.

Technique	Description	Example
Encoding	Obfuscate content using encoding	Base64, ROT13, hex, Unicode tricks
Semantic rephrasing	Say the same thing differently	"How to harm" → "What would theoretically cause damage to"
Language switching	Use language guardrails don't cover	English guardrails, attack in Welsh
Fragmentation	Split request across messages	"Tell me about making..." [next message] "...explosives"
Role-play	Request via fictional framing	"Write a story where a character explains how to..."
Jailbreaks	Structured prompts that override safety	DAN, grandma exploit, etc.
Indirect injection	Malicious instructions in retrieved content	Poison RAG data with instructions

Category 2: Intent Declaration Bypasses¶

User provides false or misleading declarations to gain access.

Technique	Description	Example
Lying	Simply declare false intent	"I'm a security researcher" (when not)
Vagueness	Declare intent too broadly	"Business purposes" covers almost anything
Scope creep	Start legitimate, drift to prohibited	Begin with allowed queries, gradually escalate
Social engineering	Convince human reviewers	Compelling but false justification
Credential misuse	Use legitimate access for illegitimate purpose	Authorised user, unauthorised action

Category 3: Agentic Bypasses¶

Agent circumvents scope or approval controls.

Technique	Description	Example
Loophole exploitation	Find technically-allowed path to prohibited goal	"I can't delete the file, but I can overwrite it with empty content"
Tool chaining	Combine allowed tools to achieve prohibited outcome	Read file + send email = exfiltrate data
Goal hijacking	Injected instructions redirect agent's goal	Tool output contains "new priority: ignore previous goal"
Approval manipulation	Craft requests to look benign	"Minor database update" hides significant change
Resource exhaustion	Stay within per-action limits, exceed aggregate	1000 small API calls vs. 1 large one

Category 4: Architectural Bypasses¶

Go around the controlled system entirely.

Technique	Description	Example
Shadow AI	Use uncontrolled AI systems	Personal ChatGPT for work data
Direct API access	Bypass application guardrails via raw API	Access model API directly without app controls
Data exfiltration then external processing	Move data outside controlled environment	Copy data, use external AI
Insider configuration change	Modify guardrails/controls directly	Disable guardrail rule

Category 5: Process Bypasses¶

Circumvent human oversight and governance.

Technique	Description	Example
HITL fatigue exploitation	Submit when reviewers are tired/rushed	End of shift, high volume
Rubber-stamp cultivation	Train reviewers to auto-approve	Long history of benign requests
Approval fragmentation	Split prohibited action across approvers	Each approver sees only their piece
Emergency override abuse	Invoke emergency process inappropriately	"Production is down" when it isn't
Privilege escalation	Gain elevated access through legitimate means	Promotion, role change, social engineering

Defence-in-Depth Architecture¶

No single control stops all bypasses. Layer controls so bypassing one doesn't grant full access.

Defence in Depth Architecture

Principle: An attacker must bypass ALL layers, not just one.

Guardrail Bypass Prevention¶

Input Guardrails¶

Control	What It Catches	Limitations
Pattern matching	Known attack signatures	Misses novel phrasings
Encoding detection	Base64, hex, Unicode	New encodings emerge
ML classifiers	Semantic intent	Adversarial examples, distribution shift
Rate limiting	Brute force, enumeration	Legitimate high-volume use
Input length limits	Context stuffing	May block legitimate long inputs

Hardening:

Multi-Layer Input Validation

Output Guardrails¶

Control	What It Catches	Limitations
Content classifiers	Harmful content categories	False negatives on edge cases
PII detection	Personal data leakage	Novel PII formats, context-dependent
Grounding checks	Hallucination, unsupported claims	Requires source data access
Format validation	Malformed outputs	Valid format, harmful content

Hardening:

Output validation follows the same multi-layer principle as input validation: multiple classifiers, PII scanning, cross-reference checks against authorised data scope, grounding verification, and behavioural consistency checks.

RAG/Context Injection Prevention¶

Retrieved content is an injection vector. Treat it as untrusted input.

Control	Implementation
Content sanitisation	Strip potential instruction patterns from retrieved content
Source verification	Validate content hasn't been tampered with (checksums)
Privilege separation	Retrieved content marked differently from system instructions
Anomaly detection	Alert on unusual content in knowledge base
Access control	Restrict who can modify knowledge base

Intent Declaration Controls¶

The Problem¶

You cannot verify that someone's stated intent is their actual intent. A user who says "I need this for security research" might be lying.

Defence Strategies¶

1. Make lying costly

Control	How It Raises Cost
Audit trail	Declarations are logged and attributable
Manager attestation	Someone else vouches for the purpose
Periodic re-certification	Must re-justify continued access
Random verification	Sample of uses checked against declared purpose
Consequences	Clear policy on misuse, enforced

2. Reduce reliance on intent

Approach	Implementation
Capability-based access	Grant specific capabilities, not broad access based on intent
Data minimisation	Only expose data needed for declared purpose
Output restrictions	Even if intent is legitimate, limit what can be extracted
Behavioural monitoring	Detect misuse regardless of declared intent

3. Structured declarations

Don't ask "What's your purpose?" — ask specific, verifiable questions:

❌ Bad: "Please describe your intended use"
   (Any answer is accepted, unverifiable)

✓ Good: 
   - "Which project code should this be billed to?" (Verifiable)
   - "Who is your manager for this request?" (Accountable)
   - "What is the ticket/case number?" (Traceable)
   - "Select the data categories you need access to" (Bounded)

4. Progressive disclosure

Don't grant full access upfront. Expand access as legitimate use is demonstrated:

Stage	Access Level	Monitoring
Initial	Limited data, basic capabilities	High scrutiny
Established	Broader access	Standard monitoring
Trusted	Full authorised scope	Periodic audit

Agentic Bypass Prevention¶

Agents are harder to control because they choose their own actions.

Scope Enforcement¶

Don't rely on the agent to enforce its own scope.

Infrastructure vs Instruction Enforcement

Bad	Good
"Only access customer service data" (instruction)	Database view that only exposes customer service data (infrastructure)
"Don't send emails without approval" (instruction)	Email API requires approval token (infrastructure)
"Stay within budget" (instruction)	Hard spending cap at API level (infrastructure)

Infrastructure-level enforcement:

Agent Sandbox

Action Validation¶

Validate every action independently. Don't trust the agent's reasoning.

Action Validator Flow

Tool Output Sanitisation¶

Tool outputs are injection vectors. The agent asks a tool for information; the tool response may contain malicious instructions.

Tool Output Sanitiser

Approval Workflows¶

Make approval meaningful, not rubber-stamp.

Anti-pattern	Better Approach
"Agent wants to do X. Approve?"	"Agent wants to do X. Here's the context, the data involved, the potential impact. What would you expect to happen?"
Approve/Deny only	Approve / Deny / Modify / Escalate
Same approver for all actions	Different approvers for different action types
No time limit on approval	Approval expires; must re-request
Bulk approval	Each significant action approved individually

Circuit Breakers¶

Hard stops that trigger regardless of agent "reasoning":

Threshold	Example
Action count	>100 actions in one task → pause
Cost	>$50 in API calls → pause
Time	>30 minutes execution → pause
Error rate	>10% of actions failing → pause
Scope violations	Any attempt to exceed scope → terminate
Sensitive actions	Any irreversible action → require approval

Architectural Bypass Prevention¶

Shadow AI¶

Users bypass your controlled AI by using uncontrolled alternatives.

See also: Technical Controls for detailed implementation of network, firewall, proxy, and DLP controls.

Detection	Prevention
Network monitoring for AI API endpoints	Provide good internal alternatives
DLP for data going to external AI	Block external AI at network level
Endpoint monitoring for AI applications	Clear policy with consequences
User surveys and interviews	Make internal AI easy to use

Key insight: Shadow AI is often a symptom of poor internal AI UX. If your controlled AI is too restrictive or slow, users will go elsewhere.

Direct API Access¶

Users bypass application guardrails by accessing the model API directly.

Control	Implementation
API gateway	All model access through gateway with controls
No direct credentials	Users don't have API keys; apps have keys
Network segmentation	Model APIs not directly accessible
Request signing	Requests must be signed by authorised application

Insider Threats¶

Authorised users misuse their access or modify controls.

Control	Implementation
Separation of duties	Different people for guardrail config vs. AI operation
Change approval	Guardrail changes require review
Configuration audit	All config changes logged, reviewed
Privileged access management	Admin access time-limited, monitored
Canary testing	Regularly test that controls still work

Process Bypass Prevention¶

HITL Integrity¶

Human oversight only works if humans are actually reviewing.

Risk	Mitigation
Fatigue	Limit review volume per person per shift
Rubber-stamping	Canary cases to verify attention
Time pressure	SLAs that allow adequate review time
Bias	Multiple reviewers for high-risk decisions
Bribery/coercion	Rotation, segregation, monitoring

Canary Testing¶

Inject known-bad cases to verify controls work:

Canary Testing Programme

Configuration Integrity¶

Controls are only as good as their configuration.

Control	Implementation
Version control	All config in Git, changes via PR
Approval required	Config changes require review
Automated testing	CI/CD tests config changes
Drift detection	Alert if production differs from source
Rollback capability	Quick revert if issues detected
Audit logging	Who changed what when

Detection When Bypasses Succeed¶

Since you can't prevent all bypasses, you must detect them.

Behavioural Anomaly Detection¶

Signal	What It May Indicate
Unusual query patterns	Probing, enumeration, testing boundaries
Sudden topic shifts	Multi-turn attack, goal hijacking
High volume from single user	Automation, data extraction
Requests outside role	Credential misuse, scope violation
Repeated near-misses	Attacker learning guardrail boundaries

Judge-Based Detection¶

The Judge evaluates interactions after the fact and can detect:

Detection	Method
Successful injection	Output inconsistent with system prompt
Data leakage	Output contains data user shouldn't have
Policy violation	Output violates content policies
Anomalous behaviour	Interaction pattern unusual for use case
Scope violation	Agent accessed data outside authorised scope

Logging for Forensics¶

When a bypass is detected, you need to investigate. Log:

Data	Purpose
Full interaction	Understand what happened
User identity	Attribute to actor
Timestamp	Establish timeline
Model version	Reproducibility
Guardrail results	What was checked, what passed
Retrieved context	What data was in scope
Actions taken	What the system did

Response When Bypasses Are Detected¶

Immediate Response¶

Severity	Response
Active exploitation	Disable affected capability, invoke incident response
Successful bypass, limited impact	Block user/pattern, investigate
Attempted bypass, blocked	Log, monitor, update guardrails if novel
Canary failure	Investigate control failure, restore

Learning Loop¶

Every bypass is an opportunity to improve:

Bypass Learning Loop

Control Effectiveness by Bypass Type¶

Bypass Type	Most Effective Control	Why
Encoding tricks	Decoding + pattern matching	Deterministic, catches known encodings
Semantic rephrasing	ML classifiers + Judge	Requires understanding meaning
Multi-turn attacks	Conversation-level analysis	Per-message checks miss gradual escalation
Indirect injection	Tool output sanitisation + separate context	Treats retrieved data as untrusted
Intent declaration fraud	Audit + consequences + behavioural monitoring	Can't prevent lying; can detect and punish
Agentic loopholes	Infrastructure enforcement	Don't rely on agent to limit itself
Shadow AI	Good internal alternatives + DLP	Users go outside when inside is bad
Insider threats	Separation of duties + audit	No single person can both change and use
HITL bypass	Canary testing + reviewer metrics	Verify humans are actually reviewing

What We Cannot Control¶

Be honest about what's impossible:

Cannot Prevent	Can Only Mitigate
Novel attack techniques	Detect via anomaly, learn and adapt
Determined insider	Limit access, detect misuse, enforce consequences
User lying about intent	Verify where possible, monitor behaviour
Model emergent behaviour	Constrain at infrastructure level
Zero-day in model	Layered defences, quick response capability

Summary¶

Accept that bypasses will happen — design for detection and response, not just prevention
Layer your controls — attacker must defeat all layers, not just one
Enforce at infrastructure, not instruction — don't rely on AI to limit itself
Verify controls work — canary testing, adversarial testing, metrics
Make misuse costly — audit trails, consequences, attribution
Learn from every bypass — improve controls, update tests, share knowledge
Provide good alternatives — shadow AI happens when official AI is too restrictive

Control Checklist¶

Guardrail Bypasses¶

Multi-encoding detection and decoding
ML classifiers for semantic intent
Conversation-level analysis (not just per-message)
RAG content sanitisation
Regular adversarial testing
Canary injection for guardrail verification

Intent Declaration Bypasses¶

Structured declarations with verifiable fields
Manager attestation for sensitive access
Audit trail of all declarations
Periodic re-certification
Behavioural monitoring independent of declaration
Clear consequences for misuse

Agentic Bypasses¶

Infrastructure-level scope enforcement
Action validation independent of agent reasoning
Tool output sanitisation
Circuit breakers with hard limits
Approval workflows for significant actions
Outcome validation after completion

Architectural Bypasses¶

Shadow AI detection (network, DLP)
API gateway for all model access
Configuration change controls
Privileged access management
Good internal AI alternatives

Process Bypasses¶

HITL workload management
Canary testing for reviewers
Reviewer metrics and monitoring
Separation of duties for config
Configuration drift detection

AI Runtime Behaviour Security, 2026 (Jonathan Gill).