The Verification Gap¶

Why current AI safety approaches can't confirm ground truth — and what's emerging to fill the gap.

The Problem¶

Every mainstream approach to AI safety relies on one of three methods:

Method	How It Works	Fundamental Flaw
Ask the model	Constitutional AI, Model Spec, system prompts	Model can still violate instructions
Pattern matching	Guardrails, keyword filters, regex	Novel attacks bypass known patterns
Ask another model	LLM-as-judge, evaluation frameworks	Second LLM has same vulnerabilities

None of these independently verify whether an output is actually true.

They check compliance, not correctness. They detect known-bad patterns, not unknown-bad content. They evaluate style and safety, not factual accuracy.

This is the verification gap.

Why This Matters¶

Consider what each layer actually does:

User Input → [Guardrails] → LLM → [Judge] → Output
                  ↓              ↓
            "Does this         "Is this
             match bad          response
             patterns?"         appropriate?"
                  ↓              ↓
              Neither asks: "Is this actually true?"

A confident hallucination passes every check: - ✅ No harmful content detected - ✅ Follows system prompt instructions
- ✅ Judge rates it as helpful and appropriate - ❌ Contains fabricated facts

The Verification Spectrum¶

Not all verification is equal. Approaches vary in their independence from the LLM being verified:

Verification Spectrum

Fully Dependent (No Independent Verification)¶

Self-assessment: Model evaluates its own output
LLM-as-judge: Another LLM evaluates output
Risk: Same failure modes, correlated errors

Partially Independent¶

Self-consistency: Multiple samples, check agreement
Retrieval grounding: RAG with trusted sources
Risk: Confident errors still pass; retrieval can fail

Fully Independent¶

Formal verification: Mathematical proof against rules
Knowledge graph lookup: Structured fact verification
External API calls: Database/calculator validation
Token-level uncertainty: Model's internal confidence signals
Risk: Limited coverage; not all claims are verifiable

Current Solutions and Their Limits¶

Formal Verification (AWS Automated Reasoning)¶

What it does: Translates policy documents into formal logic, mathematically verifies LLM outputs against rules.

Claimed accuracy: Up to 99% for policy compliance verification.

Limitations: - Requires well-structured source documents - Complex/contradictory rules don't extract cleanly - Only verifies against your rules, not general truth - Adds latency to every request - Documents limited to 50,000 characters

Best for: Regulated industries with clear, documented policies (HR, finance, compliance).

Risk rating: ⬤⬤⬤⬤○ (High effectiveness for scoped domains)

Knowledge Graph Grounding¶

What it does: Maps claims to structured knowledge, verifies relationships exist in graph.

How it works: Extract subject-predicate-object from claim → Query knowledge graph → Verify relationship exists.

Limitations: - Coverage limited to what's in the graph - Struggles with temporal facts (things that change) - Entity disambiguation is hard - Complex reasoning chains degrade accuracy - Building/maintaining KG is expensive

Best for: Domain-specific factual verification (medical, legal, scientific).

Risk rating: ⬤⬤⬤○○ (Good for covered domains, poor for general knowledge)

Token-Level Detection (HaluGate, etc.)¶

What it does: Analyzes model internals (attention patterns, hidden states, token probabilities) to detect uncertainty.

Claimed accuracy: ~96% for determining if fact-checking is needed; varies for actual detection.

Limitations: - Requires white-box access to model - Needs context to verify against (RAG scenario) - Pre-classification only identifies what to check, not whether it's true - Novel model architectures may not transfer

Best for: Production RAG systems where you have source documents.

Risk rating: ⬤⬤⬤○○ (Fast and cheap, but context-dependent)

Self-Consistency Checking¶

What it does: Generate multiple responses, check if they agree.

How it works: Ask the same question 5 times with temperature > 0 → Measure semantic similarity → Flag inconsistent answers.

Limitations: - Confident hallucinations are consistently wrong - 5x latency and cost - Agreement ≠ correctness - Doesn't work for deterministic (temp=0) deployments

Best for: Catching uncertain responses, not verifying facts.

Risk rating: ⬤⬤○○○ (Catches some errors, misses confident ones)

LLM-as-Judge¶

What it does: Uses a second (often larger) model to evaluate outputs.

Claimed accuracy: 70-90% depending on task and judge model.

Limitations: - Same fundamental vulnerabilities as primary LLM - Can be fooled by same adversarial techniques - Adds significant latency and cost - Judge may have different biases, not fewer biases - "Grading your own homework with a different pencil"

Best for: Style, safety, and appropriateness — not factual accuracy.

Risk rating: ⬤⬤○○○ (Useful layer, but not independent verification)

Effectiveness Matrix¶

Approach	Independence	Accuracy	Latency	Cost	Coverage
Formal Verification	⬤⬤⬤⬤⬤	~99%	+200-500ms	$$	Narrow (policy)
Knowledge Graph	⬤⬤⬤⬤⬤	~92%	+50-100ms	$$$ (build)	Narrow (KG)
Token Detection	⬤⬤⬤⬤○	~96%	+80-160ms	$	RAG only
Self-Consistency	⬤⬤⬤○○	~80%	+5x	5x	All
LLM-as-Judge	⬤○○○○	~80%	+500ms-5s	$$	All
Pattern Guardrails	⬤⬤⬤○○	~85%	+10-50ms	$	Known patterns

Legend: Independence = how much verification relies on LLM reasoning (5 = fully independent)

Implications for System Design¶

The Uncomfortable Truth¶

No single verification approach covers everything: - Formal verification only works for documented rules - Knowledge graphs only cover what's in them - Token detection only works with RAG context - Self-consistency misses confident errors - LLM-as-judge shares LLM vulnerabilities

Defense in Depth Isn't Enough¶

Stacking approaches with the same blind spots doesn't help:

Bad: Guardrails → LLM → LLM-Judge → LLM-Reviewer
     (All rely on LLM reasoning — correlated failures)

Better: Guardrails → LLM → Formal Verify → KG Check → Human Sample
        (Multiple independent verification methods)

Practical Recommendations¶

For Tier 4 (Critical) applications: - Require at least one fully independent verification method - Formal verification for rule compliance - Knowledge graph or external API for factual claims - Human review for edge cases

For Tier 3 (Significant) applications: - Token-level detection for RAG scenarios - Self-consistency for open-ended generation - Sampling-based human review

For Tier 2 (Moderate) applications: - LLM-as-judge is acceptable for most checks - Consider formal verification for high-stakes subsets

For Tier 1 (Minimal) applications: - Standard guardrails sufficient - LLM-as-judge for quality monitoring

The Path Forward¶

The industry is converging on a realization: verification requires multiple independent signals.

Emerging patterns include:

Hybrid pipelines: Combine cheap/fast checks (guardrails) with expensive/accurate checks (formal verification) based on claim type
Claim decomposition: Break responses into atomic claims, route each to appropriate verifier
Confidence-based escalation: Use model uncertainty to trigger deeper verification
Domain-specific verification: Build narrow, high-accuracy verifiers for specific claim types rather than trying to verify everything

The verification gap won't be closed by a single solution. It will be narrowed by thoughtful combination of independent verification methods, matched to the types of claims your application makes.

Key Takeaways¶

Current safety layers don't verify truth — they verify compliance, safety, and style
LLM-based verification shares LLM vulnerabilities — it's not independent
Fully independent verification exists but has narrow coverage (formal logic, knowledge graphs)
No single approach covers everything — hybrid pipelines are necessary
Match verification to claim type — factual claims need different verification than policy compliance

The question isn't "which verification method should we use?" It's "which verification methods cover the claims our application makes?"

Behavioral Anomaly Detection — Aggregating signals to detect drift
Why Guardrails Aren't Enough
Judge Detects, Not Decides
Current Solutions Reference

AI Runtime Behaviour Security, 2026 (Jonathan Gill).