The Verification Gap¶
Why current AI safety approaches can't confirm ground truth — and what's emerging to fill the gap.
The Problem¶
Every mainstream approach to AI safety relies on one of three methods:
| Method | How It Works | Fundamental Flaw |
|---|---|---|
| Ask the model | Constitutional AI, Model Spec, system prompts | Model can still violate instructions |
| Pattern matching | Guardrails, keyword filters, regex | Novel attacks bypass known patterns |
| Ask another model | LLM-as-judge, evaluation frameworks | Second LLM has same vulnerabilities |
None of these independently verify whether an output is actually true.
They check compliance, not correctness. They detect known-bad patterns, not unknown-bad content. They evaluate style and safety, not factual accuracy.
This is the verification gap.
Why This Matters¶
Consider what each layer actually does:
User Input → [Guardrails] → LLM → [Judge] → Output
↓ ↓
"Does this "Is this
match bad response
patterns?" appropriate?"
↓ ↓
Neither asks: "Is this actually true?"
A confident hallucination passes every check:
- ✅ No harmful content detected
- ✅ Follows system prompt instructions
- ✅ Judge rates it as helpful and appropriate
- ❌ Contains fabricated facts
The Verification Spectrum¶
Not all verification is equal. Approaches vary in their independence from the LLM being verified:
Fully Dependent (No Independent Verification)¶
- Self-assessment: Model evaluates its own output
- LLM-as-judge: Another LLM evaluates output
- Risk: Same failure modes, correlated errors
Partially Independent¶
- Self-consistency: Multiple samples, check agreement
- Retrieval grounding: RAG with trusted sources
- Risk: Confident errors still pass; retrieval can fail
Fully Independent¶
- Formal verification: Mathematical proof against rules
- Knowledge graph lookup: Structured fact verification
- External API calls: Database/calculator validation
- Token-level uncertainty: Model's internal confidence signals
- Risk: Limited coverage; not all claims are verifiable
Current Solutions and Their Limits¶
Formal Verification (AWS Automated Reasoning)¶
What it does: Translates policy documents into formal logic, mathematically verifies LLM outputs against rules.
Claimed accuracy: Up to 99% for policy compliance verification.
Limitations: - Requires well-structured source documents - Complex/contradictory rules don't extract cleanly - Only verifies against your rules, not general truth - Adds latency to every request - Documents limited to 50,000 characters
Best for: Regulated industries with clear, documented policies (HR, finance, compliance).
Risk rating: ⬤⬤⬤⬤○ (High effectiveness for scoped domains)
Knowledge Graph Grounding¶
What it does: Maps claims to structured knowledge, verifies relationships exist in graph.
How it works: Extract subject-predicate-object from claim → Query knowledge graph → Verify relationship exists.
Limitations: - Coverage limited to what's in the graph - Struggles with temporal facts (things that change) - Entity disambiguation is hard - Complex reasoning chains degrade accuracy - Building/maintaining KG is expensive
Best for: Domain-specific factual verification (medical, legal, scientific).
Risk rating: ⬤⬤⬤○○ (Good for covered domains, poor for general knowledge)
Token-Level Detection (HaluGate, etc.)¶
What it does: Analyzes model internals (attention patterns, hidden states, token probabilities) to detect uncertainty.
Claimed accuracy: ~96% for determining if fact-checking is needed; varies for actual detection.
Limitations: - Requires white-box access to model - Needs context to verify against (RAG scenario) - Pre-classification only identifies what to check, not whether it's true - Novel model architectures may not transfer
Best for: Production RAG systems where you have source documents.
Risk rating: ⬤⬤⬤○○ (Fast and cheap, but context-dependent)
Self-Consistency Checking¶
What it does: Generate multiple responses, check if they agree.
How it works: Ask the same question 5 times with temperature > 0 → Measure semantic similarity → Flag inconsistent answers.
Limitations: - Confident hallucinations are consistently wrong - 5x latency and cost - Agreement ≠ correctness - Doesn't work for deterministic (temp=0) deployments
Best for: Catching uncertain responses, not verifying facts.
Risk rating: ⬤⬤○○○ (Catches some errors, misses confident ones)
LLM-as-Judge¶
What it does: Uses a second (often larger) model to evaluate outputs.
Claimed accuracy: 70-90% depending on task and judge model.
Limitations: - Same fundamental vulnerabilities as primary LLM - Can be fooled by same adversarial techniques - Adds significant latency and cost - Judge may have different biases, not fewer biases - "Grading your own homework with a different pencil"
Best for: Style, safety, and appropriateness — not factual accuracy.
Risk rating: ⬤⬤○○○ (Useful layer, but not independent verification)
Effectiveness Matrix¶
| Approach | Independence | Accuracy | Latency | Cost | Coverage |
|---|---|---|---|---|---|
| Formal Verification | ⬤⬤⬤⬤⬤ | ~99% | +200-500ms | $$ | Narrow (policy) |
| Knowledge Graph | ⬤⬤⬤⬤⬤ | ~92% | +50-100ms | $$$ (build) | Narrow (KG) |
| Token Detection | ⬤⬤⬤⬤○ | ~96% | +80-160ms | $ | RAG only |
| Self-Consistency | ⬤⬤⬤○○ | ~80% | +5x | 5x | All |
| LLM-as-Judge | ⬤○○○○ | ~80% | +500ms-5s | $$ | All |
| Pattern Guardrails | ⬤⬤⬤○○ | ~85% | +10-50ms | $ | Known patterns |
Legend: Independence = how much verification relies on LLM reasoning (5 = fully independent)
Implications for System Design¶
The Uncomfortable Truth¶
No single verification approach covers everything: - Formal verification only works for documented rules - Knowledge graphs only cover what's in them - Token detection only works with RAG context - Self-consistency misses confident errors - LLM-as-judge shares LLM vulnerabilities
Defense in Depth Isn't Enough¶
Stacking approaches with the same blind spots doesn't help:
Bad: Guardrails → LLM → LLM-Judge → LLM-Reviewer
(All rely on LLM reasoning — correlated failures)
Better: Guardrails → LLM → Formal Verify → KG Check → Human Sample
(Multiple independent verification methods)
Practical Recommendations¶
For Tier 4 (Critical) applications: - Require at least one fully independent verification method - Formal verification for rule compliance - Knowledge graph or external API for factual claims - Human review for edge cases
For Tier 3 (Significant) applications: - Token-level detection for RAG scenarios - Self-consistency for open-ended generation - Sampling-based human review
For Tier 2 (Moderate) applications: - LLM-as-judge is acceptable for most checks - Consider formal verification for high-stakes subsets
For Tier 1 (Minimal) applications: - Standard guardrails sufficient - LLM-as-judge for quality monitoring
The Path Forward¶
The industry is converging on a realization: verification requires multiple independent signals.
Emerging patterns include:
-
Hybrid pipelines: Combine cheap/fast checks (guardrails) with expensive/accurate checks (formal verification) based on claim type
-
Claim decomposition: Break responses into atomic claims, route each to appropriate verifier
-
Confidence-based escalation: Use model uncertainty to trigger deeper verification
-
Domain-specific verification: Build narrow, high-accuracy verifiers for specific claim types rather than trying to verify everything
The verification gap won't be closed by a single solution. It will be narrowed by thoughtful combination of independent verification methods, matched to the types of claims your application makes.
Key Takeaways¶
- Current safety layers don't verify truth — they verify compliance, safety, and style
- LLM-based verification shares LLM vulnerabilities — it's not independent
- Fully independent verification exists but has narrow coverage (formal logic, knowledge graphs)
- No single approach covers everything — hybrid pipelines are necessary
- Match verification to claim type — factual claims need different verification than policy compliance
The question isn't "which verification method should we use?" It's "which verification methods cover the claims our application makes?"
Related¶
- Behavioral Anomaly Detection — Aggregating signals to detect drift
- Why Guardrails Aren't Enough
- Judge Detects, Not Decides
- Current Solutions Reference
AI Runtime Behaviour Security, 2026 (Jonathan Gill).