Practical Guardrails¶

What guardrails should actually do, what they should catch, when they can be turned off, and who decides.

Why This Exists¶

The framework describes guardrails as the first layer of defence — real-time controls that block known-bad inputs and outputs. Why Guardrails Aren't Enough explains the theory. This article makes it practical.

Most organisations deploying AI guardrails focus on prompt injection and content filtering. That's necessary but incomplete. Guardrails must also prevent personal information, sensitive data, secrets, and classified content from flowing into or out of AI systems — and they must do this across every nation whose citizens use the system. They must alert when triggered. And they must be governed: always on by default, with exceptions managed through a formal process.

This article defines what guardrails should catch, how they should work at each layer, and what governance looks like when someone asks to turn one off.

Guardrail Architecture¶

Guardrail Architecture

Guardrails operate at five points in the AI pipeline:

Point	What It Protects	Direction
Input	Filters user prompts before they reach the model	Inbound
RAG ingestion	Filters documents before they enter the knowledge base	Pre-load
RAG retrieval	Filters retrieved content before it enters the context window	Inbound
Output	Filters model responses before they reach the user	Outbound
Tool results	Filters data returned by agent tools before it enters context	Inbound

Every guardrail works in both directions. If you filter PII from inputs but not outputs, the model can still hallucinate or regurgitate PII from training data. If you filter outputs but not RAG ingestion, sensitive data sits in your vector store waiting to be retrieved.

The Two Classes of Guardrail¶

Not all guardrails are equal. Some protect against attack. Others protect against data exposure. The governance model is different for each.

Class 1: Security Guardrails — Always On¶

These protect the AI system from being compromised. They cannot be turned off. There is no exception process. Disabling them is a security incident.

Guardrail	What It Catches	Why It's Non-Negotiable
Prompt injection detection	Attempts to override system instructions	Compromises the system's integrity
Jailbreak detection	Structured attacks to bypass safety training	Removes all behavioural constraints
Encoding detection	Base64, hex, Unicode, ROT13 obfuscation	Hides attacks from other guardrails
Indirect injection scanning	Adversarial instructions in retrieved content	RAG and tool results become attack vectors
Rate limiting	Brute-force probing, enumeration, abuse	Resource exhaustion, boundary mapping
Input length limits	Context stuffing, payload hiding	Forces guardrails to process unbounded input
Request signing / authentication	Unsigned or unauthenticated API calls	Bypasses all application-level controls

These guardrails must be deployed before the system is accessible. Rate limits, authentication, and injection detection are not features you add after launch — they are preconditions for launch. An AI endpoint without rate limiting is an invitation for abuse. An AI system without injection detection is an unprotected attack surface.

Class 2: Data Protection Guardrails — Always On, Exceptions Governed¶

These protect data — personal, sensitive, classified, or secret — from entering or leaving the AI system inappropriately. They are on by default. Turning one off requires justification and data owner approval.

Guardrail	What It Catches	Default State
PII detection	Personal identifiers across all jurisdictions	On — input and output
Sensitive PII detection	Health data, biometrics, financial records, criminal records	On — input and output
Secrets detection	API keys, passwords, tokens, connection strings, private keys	On — input and output
Classification markers	Content marked Confidential, Secret, or equivalent	On — input and output
Financial data detection	Card numbers, account numbers, sort codes, IBANs	On — input and output
Credential patterns	Usernames with passwords, bearer tokens, session IDs	On — input and output

International PII Detection¶

PII is not universal. A system serving users across multiple countries must detect identifiers from every jurisdiction those users come from. UK-format checks don't catch US identifiers. US-format checks don't catch Indian identifiers.

National Identifier Patterns¶

Country	Identifier	Format	Detection Approach
UK	National Insurance Number	2 letters + 6 digits + 1 letter (e.g., AB123456C)	Regex + prefix validation
UK	NHS Number	10 digits with check digit	Regex + Modulus 11 check
US	Social Security Number	3-2-4 digits (e.g., 123-45-6789)	Regex + area number validation
US	Driver's License	State-specific formats	State-specific regex library
EU	National ID numbers	Country-specific (Germany: 11-digit IdNr, France: 15-digit INSEE, Spain: 8-digit + letter DNI, Netherlands: 9-digit BSN)	Per-country regex
India	Aadhaar	12 digits with Verhoeff check	Regex + checksum validation
Australia	Tax File Number	8-9 digits with check algorithm	Regex + algorithm check
Australia	Medicare number	10-11 digits	Regex + check digit
Canada	Social Insurance Number	9 digits (3-3-3) with Luhn check	Regex + Luhn validation
Brazil	CPF	11 digits with check digits	Regex + checksum validation
Japan	My Number	12 digits with check digit	Regex + checksum
South Korea	RRN	13 digits (6-7)	Regex + date validation
Singapore	NRIC/FIN	1 letter + 7 digits + 1 letter	Regex + check letter

Beyond National IDs¶

PII extends beyond government-issued identifiers:

Category	Examples	Detection Method
Names	Full names, surnames combined with context	NER models (not regex — names are too variable)
Addresses	Street addresses, postcodes/ZIP codes	NER + postcode regex per country
Phone numbers	International formats with country codes	Regex library (e.g., Google's libphonenumber patterns)
Email addresses	Any email format	Standard regex
Dates of birth	Multiple date formats (DD/MM/YYYY, MM/DD/YYYY, YYYY-MM-DD)	Date regex + context (look for "born", "DOB", "date of birth")
Bank accounts	IBAN (international), sort code + account (UK), routing + account (US)	Per-format regex
Payment cards	Visa, Mastercard, Amex, etc.	Luhn algorithm + BIN range check
Passport numbers	Country-specific formats	Per-country regex library
IP addresses	IPv4 and IPv6	Standard regex + context to distinguish from version numbers
Vehicle registrations	Country-specific formats	Per-country regex
Biometric identifiers	Fingerprint hashes, face encodings	Pattern matching on known encoding formats
Medical record numbers	Facility-specific formats	NER + context ("patient", "MRN", "record")

Implementation Approach¶

Layer 1 — Regex (fast, deterministic): Catches structured PII with known formats. Run first. Low latency.

Layer 2 — NER models (accurate, broader): Catches unstructured PII (names, addresses, contextual identifiers). Run second. Higher latency but catches what regex misses.

Layer 3 — Classification models (semantic): Catches sensitive information that isn't structured PII — health conditions described in natural language, financial distress indicators, legal matter descriptions. Run on higher-risk tiers.

Detection must be locale-aware. If your system serves UK and US users, your PII detection must handle both NI numbers and SSNs, both NHS numbers and Medicare IDs, both sort codes and routing numbers. If you add users from a new country, update the detection library before launch.

RAG Ingestion Filtering¶

Filtering data after it's in the vector store is harder than filtering it before. Once sensitive content is embedded and indexed, it can be retrieved by any query with sufficient semantic similarity. The time to catch it is at ingestion.

What to Filter at Ingestion¶

Check	Action	Rationale
PII scan	Detect, flag, optionally redact or tokenise before embedding	PII in the vector store is PII in every retrieval
Secrets scan	Block ingestion; secrets must never be embedded	Embedded secrets are retrievable by semantic similarity
Classification check	Verify document classification matches the RAG corpus classification	A SECRET document should not be ingested into an INTERNAL corpus
Adversarial content scan	Flag instruction-like patterns for human review	Prevents indirect prompt injection at source
Staleness check	Verify document is current; reject or flag outdated content	Stale data produces stale answers
Source authentication	Verify the document source is authorised	Prevents unauthorised content injection

Redaction vs. Exclusion¶

Strategy	When to Use	Trade-off
Redact PII, ingest remainder	Document has value beyond the PII it contains	Redacted content may lose context; retrieval quality may degrade
Tokenise PII, ingest with tokens	Referential integrity matters (same person referenced across chunks)	Tokens must be consistent; mapping must be secured
Exclude entire document	Document is predominantly sensitive; redaction would destroy value	Knowledge gap in the corpus
Ingest with access controls	PII is necessary for the use case; access is restricted	Requires access-controlled retrieval (see RAG Security)

The default should be: scan everything at ingestion. Flag or redact PII. Block secrets. Verify classification. Exceptions follow the governance process described below.

Guardrail Alerting¶

A guardrail that blocks silently is a guardrail nobody learns from. Every guardrail trigger should produce an alert.

What to Alert On¶

Event	Alert Level	Recipient	Response
PII detected in input	Info	Security operations	Log; review if pattern emerges
PII detected in output	Warning	Security operations	Investigate source (hallucination vs. RAG vs. training data)
Secrets detected	High	Security operations + secret owner	Rotate the secret immediately
Prompt injection attempt	Warning	Security operations	Log; block; monitor for escalation
Successful injection (guardrail passed, Judge flagged)	Critical	Incident response	Invoke incident playbook
Classification boundary violation	High	Data owner + security	Investigate data flow; adjust controls
Rate limit triggered	Info → Warning at threshold	Security operations	Review for abuse pattern
Guardrail bypass detected	Critical	Incident response	Immediate investigation
Unusual block rate	Warning	Security + operations	Investigate — either new attack or false positive spike

Alert Fatigue Prevention¶

Control	How It Helps
Tiered alerting	Info events aggregate; only anomalies escalate
Correlation	Multiple low-level events from same source escalate automatically
Baseline comparison	Alert on deviation from normal block rate, not absolute counts
Tuning feedback loop	False positives feed back to improve detection; reduce noise over time
Digest reports	Daily/weekly summaries for trends; real-time only for critical events

The Judge as a Data Protection Layer¶

Guardrails catch known patterns. The Judge catches what guardrails miss — including sensitive data that doesn't match a regex.

What the Judge Should Check For¶

Check	Why Guardrails Miss It	Judge Approach
Contextual PII	"The patient in room 312 with the heart condition" contains no formal PII but identifies a person	Judge evaluates whether the response could identify an individual in context
Aggregated PII	Individual fields aren't PII; combined, they are (job title + department + start date = identifiable)	Judge assesses whether the combination of disclosed attributes could re-identify someone
Implied sensitive data	"Based on your recent oncology appointment..." reveals health status without naming a condition	Judge flags responses that imply sensitive categories (health, finance, legal)
Memorised training data	Model reproduces verbatim text from training data containing PII	Judge compares output patterns against known memorisation indicators
Source attribution leaks	"According to the document submitted by John Smith on 14 March..." reveals document authorship	Judge checks whether source attribution exposes information the user shouldn't have
Cross-session leakage	Information from User A's session appears in User B's response	Judge evaluates whether response contains information not present in the current session's context

Judge Evaluation Criteria for Data Protection¶

Add these to the Judge's evaluation prompt alongside existing quality and policy criteria:

DATA PROTECTION EVALUATION:
1. Does the response contain personal identifiers (names, numbers, addresses)?
2. Could the response identify a specific individual, even without formal PII?
3. Does the response reveal sensitive categories (health, financial, legal, criminal)?
4. Does the response contain credentials, API keys, or secrets?
5. Does the response include information that was not in the user's authorised context?
6. Does the response attribute content to specific individuals when attribution wasn't requested?

Scoring: Any "yes" on questions 1-4 should trigger REVIEW. Any "yes" on questions 5-6 should trigger ESCALATE.

Human Reviewers as the Final Data Protection Check¶

HITL reviewers are the last line of defence. When they review AI interactions — whether triggered by the Judge, by sampling, or by user escalation — they should actively check for data protection issues.

What Human Reviewers Should Look For¶

Check	Instruction to Reviewer
PII in responses	Does the AI response contain names, identifiers, addresses, or other personal data that shouldn't be there?
Sensitive data exposure	Does the response reveal health, financial, legal, or employment information about identifiable individuals?
Credential leakage	Does the response contain anything that looks like a password, API key, token, or connection string?
Classification breach	Does the response contain content that is marked or should be marked at a higher classification than the user's clearance?
Context-inappropriate data	Does the response contain information that, while not formally classified, seems inappropriate for this use case or user?
Re-identification risk	Even without formal PII, could someone use the information in this response to identify a specific person?

Reviewer Training¶

Reviewers must be trained to recognise data protection issues — not just content quality. Include:

Examples of PII that guardrails miss (contextual, aggregated, implied)
Examples of secrets in unexpected formats (connection strings in code blocks, tokens in URLs)
The difference between formal PII and re-identifiable data
Escalation paths when sensitive data is found (who to notify, what to remediate)

Canary testing for data protection: Include known PII samples in the canary test set. If reviewers don't flag them, the training needs refreshing.

Governance: Managing Guardrail Exceptions¶

Guardrails are always on. But "always" meets reality when a legitimate business need requires a guardrail to be adjusted or disabled for a specific use case. The governance model must handle this without becoming either a rubber stamp or a blocker.

The Exception Process¶

Guardrail Exception Governance

Step	Activity	Owner	Output
1	Request — team identifies a guardrail that is blocking legitimate use	Use case owner	Written request naming the specific guardrail, the use case, and why it's blocking
2	Classify — is this a security guardrail or a data protection guardrail?	Security	Classification determines the process
3a	Security guardrail — request denied. Security guardrails cannot be disabled. Find another approach.	Security	Denial with guidance on alternatives
3b	Data protection guardrail — assess the risk of the exception	Risk analyst	Risk assessment documenting what data is exposed, to whom, under what conditions
4	Data owner decision — the data owner (not the use case owner, not IT) decides whether to accept the risk	Data owner	Signed acceptance or rejection
5	Implement with compensating controls — if approved, disable the specific guardrail for the specific use case with compensating controls	Engineering + Security	Scoped exception with compensating controls documented
6	Monitor — increased Judge sampling and human review for the excepted use case	Operations	Enhanced monitoring active
7	Review — exception reviewed at defined interval (90 days maximum)	Governance	Exception renewed or retired

Principles¶

Security guardrails have no exception process. Prompt injection detection, jailbreak detection, rate limiting, and authentication controls cannot be disabled. If they're causing false positives, the fix is to improve the detection — not to remove it.
Data protection exceptions require the data owner's decision. Not the project team. Not IT. Not the CISO. The person accountable for the data decides whether the risk is acceptable. This is non-negotiable.
Exceptions are scoped. An exception applies to one guardrail, for one use case, for a defined period. It is not a general waiver.
Exceptions require compensating controls. If PII detection is disabled for a medical triage use case that legitimately needs health data, the compensating controls might include 100% Judge sampling, mandatory HITL review, enhanced logging, and a restricted user population.
Exceptions expire. No permanent exceptions. Maximum 90 days before re-review. If the business need is permanent, the solution is to improve the guardrail (make it context-aware) rather than permanently disable it.

Governance Dashboard¶

What the governance function should track:

Metric	Target	Why
Active guardrail exceptions	Trending down	Exceptions should be temporary
Exceptions by guardrail type	Visibility	Identifies guardrails that need improvement
Exceptions past review date	0	Expired exceptions are unreviewed risk
Exception-related incidents	0	Validates that compensating controls work
Time from request to decision	<5 business days	Governance shouldn't be a blocker
False positive rate by guardrail	Trending down	Reduces future exception requests

Pre-Breach Controls: Rate Limits and Data Validation¶

Some controls must be in place before an AI system is exposed to users — not as a response to an incident, but as a precondition for deployment. These are pre-breach controls: they exist to limit the blast radius when (not if) something goes wrong.

Rate Limiting¶

Control	Purpose	Configuration
Per-user rate limit	Prevents single-user abuse	Requests per minute/hour, scaled to legitimate use
Per-session rate limit	Prevents automated session abuse	Max interactions per session
Global rate limit	Prevents system overload	Max total requests, with queuing
Cost-based limit	Prevents unexpected spend	Daily/monthly token budget per user or use case
Escalating limits	Tightens limits on suspicious behaviour	Normal → reduced → blocked as anomalies detected

Deploy rate limits before launch. Calibrate after launch based on usage data. Don't wait for abuse to add limits — by then, the damage is done.

Input Data Validation¶

Validation	What It Catches	Where It Runs
Schema validation	Malformed structured inputs (JSON, XML)	API gateway
Character encoding validation	Mixed encodings, homoglyph attacks	Input guardrail
Language detection	Inputs in unexpected languages (may bypass language-specific guardrails)	Input guardrail
File type validation	Unexpected file types in multimodal inputs	Input guardrail
Size validation	Oversized inputs (images, documents, audio)	API gateway
Content type validation	Mismatch between declared and actual content type	API gateway

Beyond Guardrails: Where Data Leaks¶

Guardrails protect the front door. Data also leaks through side channels that guardrails never see.

The Leak Points¶

Leak Point	How Data Escapes	Detection
Application logs	Full prompts and responses written to logs with PII intact	Log pipeline PII scanning; redact before storage
Observability traces	Distributed traces capture prompt content across services	Trace sanitisation; PII stripping in the trace pipeline
Error messages	Stack traces and error responses expose system internals, context fragments	Error handler sanitisation; generic error responses to users
Model telemetry	Usage analytics include prompt samples	Sampling pipeline PII redaction
Conversation exports	Users export chat history containing other users' data (multi-tenant)	Export filtering; per-user data isolation
Embedding vectors	Embeddings can be inverted to approximate source text	Embedding access controls; monitor for bulk extraction
Cache layers	Response caches may serve one user's response to another	Per-user cache keys; no shared caching for sensitive use cases
Evaluation pipelines	Judge evaluation data contains full interaction content	Tokenise PII before Judge evaluation (see Data Protection DAT-08)
Backup and disaster recovery	Database backups contain all conversation history	Backup encryption; same classification as source data
Third-party integrations	Webhooks, analytics, support tools receive interaction data	Data flow mapping; DLP on outbound integrations
Browser/client state	Conversation state stored in local storage or session storage	Client-side data handling policy; auto-clear on session end

Defence in Depth Beyond the Guardrail¶

Layer	Control	What It Catches
Network	DLP on egress traffic	Data leaving the environment via any channel
Log pipeline	PII scanning before log storage	Sensitive data in operational logs
API gateway	Response header scrubbing	System internals in HTTP headers
Cloud storage	Classification-aware access controls	Misclassified data accessible to wrong roles
Monitoring	Anomaly detection on data volumes	Unusual data extraction patterns
Endpoint	Clipboard/download monitoring for sensitive use cases	Data exfiltration via user endpoint

Putting It Together: Guardrail Deployment Checklist¶

Before Launch¶

Ongoing¶

Guardrail effectiveness reviewed weekly (first 30 days), then monthly
False positive rates tracked and guardrails tuned
New PII patterns added when user population expands to new jurisdictions
Judge findings that reveal guardrail gaps feed back to guardrail improvement
Active exceptions reviewed at 90-day maximum intervals
Canary tests include data protection scenarios
Log pipeline scanning verified (PII not appearing in stored logs)
Side-channel leak points audited quarterly

Relationship to Other Articles¶

Why Guardrails Aren't Enough explains the theory — guardrails are necessary but not sufficient. This article makes the guardrail layer practical.
Data Protection defines the formal DAT-01 through DAT-08 controls. This article provides implementation guidance for the guardrail components of those controls.
Bypass Prevention covers what happens when guardrails are circumvented. This article focuses on making them hard to circumvent in the first place.
RAG Security covers RAG-specific controls. This article adds the ingestion filtering perspective.
Controls: Guardrails, Judge, and Human Oversight defines the three-layer model. This article provides depth on the first layer with practical detection guidance and governance for the second and third layers' data protection role.

AI Runtime Behaviour Security, 2026 (Jonathan Gill).