Skip to content

Practical Guardrails

What guardrails should actually do, what they should catch, when they can be turned off, and who decides.


Why This Exists

The framework describes guardrails as the first layer of defence — real-time controls that block known-bad inputs and outputs. Why Guardrails Aren't Enough explains the theory. This article makes it practical.

Most organisations deploying AI guardrails focus on prompt injection and content filtering. That's necessary but incomplete. Guardrails must also prevent personal information, sensitive data, secrets, and classified content from flowing into or out of AI systems — and they must do this across every nation whose citizens use the system. They must alert when triggered. And they must be governed: always on by default, with exceptions managed through a formal process.

This article defines what guardrails should catch, how they should work at each layer, and what governance looks like when someone asks to turn one off.


Guardrail Architecture

Guardrail Architecture

Guardrails operate at five points in the AI pipeline:

Point What It Protects Direction
Input Filters user prompts before they reach the model Inbound
RAG ingestion Filters documents before they enter the knowledge base Pre-load
RAG retrieval Filters retrieved content before it enters the context window Inbound
Output Filters model responses before they reach the user Outbound
Tool results Filters data returned by agent tools before it enters context Inbound

Every guardrail works in both directions. If you filter PII from inputs but not outputs, the model can still hallucinate or regurgitate PII from training data. If you filter outputs but not RAG ingestion, sensitive data sits in your vector store waiting to be retrieved.


The Two Classes of Guardrail

Not all guardrails are equal. Some protect against attack. Others protect against data exposure. The governance model is different for each.

Class 1: Security Guardrails — Always On

These protect the AI system from being compromised. They cannot be turned off. There is no exception process. Disabling them is a security incident.

Guardrail What It Catches Why It's Non-Negotiable
Prompt injection detection Attempts to override system instructions Compromises the system's integrity
Jailbreak detection Structured attacks to bypass safety training Removes all behavioural constraints
Encoding detection Base64, hex, Unicode, ROT13 obfuscation Hides attacks from other guardrails
Indirect injection scanning Adversarial instructions in retrieved content RAG and tool results become attack vectors
Rate limiting Brute-force probing, enumeration, abuse Resource exhaustion, boundary mapping
Input length limits Context stuffing, payload hiding Forces guardrails to process unbounded input
Request signing / authentication Unsigned or unauthenticated API calls Bypasses all application-level controls

These guardrails must be deployed before the system is accessible. Rate limits, authentication, and injection detection are not features you add after launch — they are preconditions for launch. An AI endpoint without rate limiting is an invitation for abuse. An AI system without injection detection is an unprotected attack surface.

Class 2: Data Protection Guardrails — Always On, Exceptions Governed

These protect data — personal, sensitive, classified, or secret — from entering or leaving the AI system inappropriately. They are on by default. Turning one off requires justification and data owner approval.

Guardrail What It Catches Default State
PII detection Personal identifiers across all jurisdictions On — input and output
Sensitive PII detection Health data, biometrics, financial records, criminal records On — input and output
Secrets detection API keys, passwords, tokens, connection strings, private keys On — input and output
Classification markers Content marked Confidential, Secret, or equivalent On — input and output
Financial data detection Card numbers, account numbers, sort codes, IBANs On — input and output
Credential patterns Usernames with passwords, bearer tokens, session IDs On — input and output

International PII Detection

PII is not universal. A system serving users across multiple countries must detect identifiers from every jurisdiction those users come from. UK-format checks don't catch US identifiers. US-format checks don't catch Indian identifiers.

National Identifier Patterns

Country Identifier Format Detection Approach
UK National Insurance Number 2 letters + 6 digits + 1 letter (e.g., AB123456C) Regex + prefix validation
UK NHS Number 10 digits with check digit Regex + Modulus 11 check
US Social Security Number 3-2-4 digits (e.g., 123-45-6789) Regex + area number validation
US Driver's License State-specific formats State-specific regex library
EU National ID numbers Country-specific (Germany: 11-digit IdNr, France: 15-digit INSEE, Spain: 8-digit + letter DNI, Netherlands: 9-digit BSN) Per-country regex
India Aadhaar 12 digits with Verhoeff check Regex + checksum validation
Australia Tax File Number 8-9 digits with check algorithm Regex + algorithm check
Australia Medicare number 10-11 digits Regex + check digit
Canada Social Insurance Number 9 digits (3-3-3) with Luhn check Regex + Luhn validation
Brazil CPF 11 digits with check digits Regex + checksum validation
Japan My Number 12 digits with check digit Regex + checksum
South Korea RRN 13 digits (6-7) Regex + date validation
Singapore NRIC/FIN 1 letter + 7 digits + 1 letter Regex + check letter

Beyond National IDs

PII extends beyond government-issued identifiers:

Category Examples Detection Method
Names Full names, surnames combined with context NER models (not regex — names are too variable)
Addresses Street addresses, postcodes/ZIP codes NER + postcode regex per country
Phone numbers International formats with country codes Regex library (e.g., Google's libphonenumber patterns)
Email addresses Any email format Standard regex
Dates of birth Multiple date formats (DD/MM/YYYY, MM/DD/YYYY, YYYY-MM-DD) Date regex + context (look for "born", "DOB", "date of birth")
Bank accounts IBAN (international), sort code + account (UK), routing + account (US) Per-format regex
Payment cards Visa, Mastercard, Amex, etc. Luhn algorithm + BIN range check
Passport numbers Country-specific formats Per-country regex library
IP addresses IPv4 and IPv6 Standard regex + context to distinguish from version numbers
Vehicle registrations Country-specific formats Per-country regex
Biometric identifiers Fingerprint hashes, face encodings Pattern matching on known encoding formats
Medical record numbers Facility-specific formats NER + context ("patient", "MRN", "record")

Implementation Approach

Layer 1 — Regex (fast, deterministic): Catches structured PII with known formats. Run first. Low latency.

Layer 2 — NER models (accurate, broader): Catches unstructured PII (names, addresses, contextual identifiers). Run second. Higher latency but catches what regex misses.

Layer 3 — Classification models (semantic): Catches sensitive information that isn't structured PII — health conditions described in natural language, financial distress indicators, legal matter descriptions. Run on higher-risk tiers.

Detection must be locale-aware. If your system serves UK and US users, your PII detection must handle both NI numbers and SSNs, both NHS numbers and Medicare IDs, both sort codes and routing numbers. If you add users from a new country, update the detection library before launch.


RAG Ingestion Filtering

Filtering data after it's in the vector store is harder than filtering it before. Once sensitive content is embedded and indexed, it can be retrieved by any query with sufficient semantic similarity. The time to catch it is at ingestion.

What to Filter at Ingestion

Check Action Rationale
PII scan Detect, flag, optionally redact or tokenise before embedding PII in the vector store is PII in every retrieval
Secrets scan Block ingestion; secrets must never be embedded Embedded secrets are retrievable by semantic similarity
Classification check Verify document classification matches the RAG corpus classification A SECRET document should not be ingested into an INTERNAL corpus
Adversarial content scan Flag instruction-like patterns for human review Prevents indirect prompt injection at source
Staleness check Verify document is current; reject or flag outdated content Stale data produces stale answers
Source authentication Verify the document source is authorised Prevents unauthorised content injection

Redaction vs. Exclusion

Strategy When to Use Trade-off
Redact PII, ingest remainder Document has value beyond the PII it contains Redacted content may lose context; retrieval quality may degrade
Tokenise PII, ingest with tokens Referential integrity matters (same person referenced across chunks) Tokens must be consistent; mapping must be secured
Exclude entire document Document is predominantly sensitive; redaction would destroy value Knowledge gap in the corpus
Ingest with access controls PII is necessary for the use case; access is restricted Requires access-controlled retrieval (see RAG Security)

The default should be: scan everything at ingestion. Flag or redact PII. Block secrets. Verify classification. Exceptions follow the governance process described below.


Guardrail Alerting

A guardrail that blocks silently is a guardrail nobody learns from. Every guardrail trigger should produce an alert.

What to Alert On

Event Alert Level Recipient Response
PII detected in input Info Security operations Log; review if pattern emerges
PII detected in output Warning Security operations Investigate source (hallucination vs. RAG vs. training data)
Secrets detected High Security operations + secret owner Rotate the secret immediately
Prompt injection attempt Warning Security operations Log; block; monitor for escalation
Successful injection (guardrail passed, Judge flagged) Critical Incident response Invoke incident playbook
Classification boundary violation High Data owner + security Investigate data flow; adjust controls
Rate limit triggered Info → Warning at threshold Security operations Review for abuse pattern
Guardrail bypass detected Critical Incident response Immediate investigation
Unusual block rate Warning Security + operations Investigate — either new attack or false positive spike

Alert Fatigue Prevention

Control How It Helps
Tiered alerting Info events aggregate; only anomalies escalate
Correlation Multiple low-level events from same source escalate automatically
Baseline comparison Alert on deviation from normal block rate, not absolute counts
Tuning feedback loop False positives feed back to improve detection; reduce noise over time
Digest reports Daily/weekly summaries for trends; real-time only for critical events

The Judge as a Data Protection Layer

Guardrails catch known patterns. The Judge catches what guardrails miss — including sensitive data that doesn't match a regex.

What the Judge Should Check For

Check Why Guardrails Miss It Judge Approach
Contextual PII "The patient in room 312 with the heart condition" contains no formal PII but identifies a person Judge evaluates whether the response could identify an individual in context
Aggregated PII Individual fields aren't PII; combined, they are (job title + department + start date = identifiable) Judge assesses whether the combination of disclosed attributes could re-identify someone
Implied sensitive data "Based on your recent oncology appointment..." reveals health status without naming a condition Judge flags responses that imply sensitive categories (health, finance, legal)
Memorised training data Model reproduces verbatim text from training data containing PII Judge compares output patterns against known memorisation indicators
Source attribution leaks "According to the document submitted by John Smith on 14 March..." reveals document authorship Judge checks whether source attribution exposes information the user shouldn't have
Cross-session leakage Information from User A's session appears in User B's response Judge evaluates whether response contains information not present in the current session's context

Judge Evaluation Criteria for Data Protection

Add these to the Judge's evaluation prompt alongside existing quality and policy criteria:

DATA PROTECTION EVALUATION:
1. Does the response contain personal identifiers (names, numbers, addresses)?
2. Could the response identify a specific individual, even without formal PII?
3. Does the response reveal sensitive categories (health, financial, legal, criminal)?
4. Does the response contain credentials, API keys, or secrets?
5. Does the response include information that was not in the user's authorised context?
6. Does the response attribute content to specific individuals when attribution wasn't requested?

Scoring: Any "yes" on questions 1-4 should trigger REVIEW. Any "yes" on questions 5-6 should trigger ESCALATE.


Human Reviewers as the Final Data Protection Check

HITL reviewers are the last line of defence. When they review AI interactions — whether triggered by the Judge, by sampling, or by user escalation — they should actively check for data protection issues.

What Human Reviewers Should Look For

Check Instruction to Reviewer
PII in responses Does the AI response contain names, identifiers, addresses, or other personal data that shouldn't be there?
Sensitive data exposure Does the response reveal health, financial, legal, or employment information about identifiable individuals?
Credential leakage Does the response contain anything that looks like a password, API key, token, or connection string?
Classification breach Does the response contain content that is marked or should be marked at a higher classification than the user's clearance?
Context-inappropriate data Does the response contain information that, while not formally classified, seems inappropriate for this use case or user?
Re-identification risk Even without formal PII, could someone use the information in this response to identify a specific person?

Reviewer Training

Reviewers must be trained to recognise data protection issues — not just content quality. Include:

  • Examples of PII that guardrails miss (contextual, aggregated, implied)
  • Examples of secrets in unexpected formats (connection strings in code blocks, tokens in URLs)
  • The difference between formal PII and re-identifiable data
  • Escalation paths when sensitive data is found (who to notify, what to remediate)

Canary testing for data protection: Include known PII samples in the canary test set. If reviewers don't flag them, the training needs refreshing.


Governance: Managing Guardrail Exceptions

Guardrails are always on. But "always" meets reality when a legitimate business need requires a guardrail to be adjusted or disabled for a specific use case. The governance model must handle this without becoming either a rubber stamp or a blocker.

The Exception Process

Guardrail Exception Governance

Step Activity Owner Output
1 Request — team identifies a guardrail that is blocking legitimate use Use case owner Written request naming the specific guardrail, the use case, and why it's blocking
2 Classify — is this a security guardrail or a data protection guardrail? Security Classification determines the process
3a Security guardrail — request denied. Security guardrails cannot be disabled. Find another approach. Security Denial with guidance on alternatives
3b Data protection guardrail — assess the risk of the exception Risk analyst Risk assessment documenting what data is exposed, to whom, under what conditions
4 Data owner decision — the data owner (not the use case owner, not IT) decides whether to accept the risk Data owner Signed acceptance or rejection
5 Implement with compensating controls — if approved, disable the specific guardrail for the specific use case with compensating controls Engineering + Security Scoped exception with compensating controls documented
6 Monitor — increased Judge sampling and human review for the excepted use case Operations Enhanced monitoring active
7 Review — exception reviewed at defined interval (90 days maximum) Governance Exception renewed or retired

Principles

  • Security guardrails have no exception process. Prompt injection detection, jailbreak detection, rate limiting, and authentication controls cannot be disabled. If they're causing false positives, the fix is to improve the detection — not to remove it.
  • Data protection exceptions require the data owner's decision. Not the project team. Not IT. Not the CISO. The person accountable for the data decides whether the risk is acceptable. This is non-negotiable.
  • Exceptions are scoped. An exception applies to one guardrail, for one use case, for a defined period. It is not a general waiver.
  • Exceptions require compensating controls. If PII detection is disabled for a medical triage use case that legitimately needs health data, the compensating controls might include 100% Judge sampling, mandatory HITL review, enhanced logging, and a restricted user population.
  • Exceptions expire. No permanent exceptions. Maximum 90 days before re-review. If the business need is permanent, the solution is to improve the guardrail (make it context-aware) rather than permanently disable it.

Governance Dashboard

What the governance function should track:

Metric Target Why
Active guardrail exceptions Trending down Exceptions should be temporary
Exceptions by guardrail type Visibility Identifies guardrails that need improvement
Exceptions past review date 0 Expired exceptions are unreviewed risk
Exception-related incidents 0 Validates that compensating controls work
Time from request to decision <5 business days Governance shouldn't be a blocker
False positive rate by guardrail Trending down Reduces future exception requests

Pre-Breach Controls: Rate Limits and Data Validation

Some controls must be in place before an AI system is exposed to users — not as a response to an incident, but as a precondition for deployment. These are pre-breach controls: they exist to limit the blast radius when (not if) something goes wrong.

Rate Limiting

Control Purpose Configuration
Per-user rate limit Prevents single-user abuse Requests per minute/hour, scaled to legitimate use
Per-session rate limit Prevents automated session abuse Max interactions per session
Global rate limit Prevents system overload Max total requests, with queuing
Cost-based limit Prevents unexpected spend Daily/monthly token budget per user or use case
Escalating limits Tightens limits on suspicious behaviour Normal → reduced → blocked as anomalies detected

Deploy rate limits before launch. Calibrate after launch based on usage data. Don't wait for abuse to add limits — by then, the damage is done.

Input Data Validation

Validation What It Catches Where It Runs
Schema validation Malformed structured inputs (JSON, XML) API gateway
Character encoding validation Mixed encodings, homoglyph attacks Input guardrail
Language detection Inputs in unexpected languages (may bypass language-specific guardrails) Input guardrail
File type validation Unexpected file types in multimodal inputs Input guardrail
Size validation Oversized inputs (images, documents, audio) API gateway
Content type validation Mismatch between declared and actual content type API gateway

Beyond Guardrails: Where Data Leaks

Guardrails protect the front door. Data also leaks through side channels that guardrails never see.

The Leak Points

Leak Point How Data Escapes Detection
Application logs Full prompts and responses written to logs with PII intact Log pipeline PII scanning; redact before storage
Observability traces Distributed traces capture prompt content across services Trace sanitisation; PII stripping in the trace pipeline
Error messages Stack traces and error responses expose system internals, context fragments Error handler sanitisation; generic error responses to users
Model telemetry Usage analytics include prompt samples Sampling pipeline PII redaction
Conversation exports Users export chat history containing other users' data (multi-tenant) Export filtering; per-user data isolation
Embedding vectors Embeddings can be inverted to approximate source text Embedding access controls; monitor for bulk extraction
Cache layers Response caches may serve one user's response to another Per-user cache keys; no shared caching for sensitive use cases
Evaluation pipelines Judge evaluation data contains full interaction content Tokenise PII before Judge evaluation (see Data Protection DAT-08)
Backup and disaster recovery Database backups contain all conversation history Backup encryption; same classification as source data
Third-party integrations Webhooks, analytics, support tools receive interaction data Data flow mapping; DLP on outbound integrations
Browser/client state Conversation state stored in local storage or session storage Client-side data handling policy; auto-clear on session end

Defence in Depth Beyond the Guardrail

Layer Control What It Catches
Network DLP on egress traffic Data leaving the environment via any channel
Log pipeline PII scanning before log storage Sensitive data in operational logs
API gateway Response header scrubbing System internals in HTTP headers
Cloud storage Classification-aware access controls Misclassified data accessible to wrong roles
Monitoring Anomaly detection on data volumes Unusual data extraction patterns
Endpoint Clipboard/download monitoring for sensitive use cases Data exfiltration via user endpoint

Putting It Together: Guardrail Deployment Checklist

Before Launch

  • Security guardrails active: injection detection, encoding detection, rate limiting, authentication
  • Data protection guardrails active: PII detection (all relevant jurisdictions), secrets detection, classification checking
  • RAG ingestion pipeline filters for PII, secrets, adversarial content, and classification mismatches
  • Input and output guardrails both active (bidirectional)
  • Alerting configured for all guardrail trigger events
  • Rate limits calibrated and deployed
  • Input validation active at API gateway
  • Log pipeline PII scanning active
  • Judge evaluation criteria include data protection checks
  • HITL reviewer training includes data protection recognition
  • Guardrail exception process documented and communicated

Ongoing

  • Guardrail effectiveness reviewed weekly (first 30 days), then monthly
  • False positive rates tracked and guardrails tuned
  • New PII patterns added when user population expands to new jurisdictions
  • Judge findings that reveal guardrail gaps feed back to guardrail improvement
  • Active exceptions reviewed at 90-day maximum intervals
  • Canary tests include data protection scenarios
  • Log pipeline scanning verified (PII not appearing in stored logs)
  • Side-channel leak points audited quarterly

  • Why Guardrails Aren't Enough explains the theory — guardrails are necessary but not sufficient. This article makes the guardrail layer practical.
  • Data Protection defines the formal DAT-01 through DAT-08 controls. This article provides implementation guidance for the guardrail components of those controls.
  • Bypass Prevention covers what happens when guardrails are circumvented. This article focuses on making them hard to circumvent in the first place.
  • RAG Security covers RAG-specific controls. This article adds the ingestion filtering perspective.
  • Controls: Guardrails, Judge, and Human Oversight defines the three-layer model. This article provides depth on the first layer with practical detection guidance and governance for the second and third layers' data protection role.

AI Runtime Behaviour Security, 2026 (Jonathan Gill).