Skip to content

The Verification Gap

Why current AI safety approaches can't confirm ground truth — and what's emerging to fill the gap.


The Problem

Every mainstream approach to AI safety relies on one of three methods:

Method How It Works Fundamental Flaw
Ask the model Constitutional AI, Model Spec, system prompts Model can still violate instructions
Pattern matching Guardrails, keyword filters, regex Novel attacks bypass known patterns
Ask another model LLM-as-judge, evaluation frameworks Second LLM has same vulnerabilities

None of these independently verify whether an output is actually true.

They check compliance, not correctness. They detect known-bad patterns, not unknown-bad content. They evaluate style and safety, not factual accuracy.

This is the verification gap.


Why This Matters

Consider what each layer actually does:

User Input → [Guardrails] → LLM → [Judge] → Output
                  ↓              ↓
            "Does this         "Is this
             match bad          response
             patterns?"         appropriate?"
                  ↓              ↓
              Neither asks: "Is this actually true?"

A confident hallucination passes every check: - ✅ No harmful content detected - ✅ Follows system prompt instructions
- ✅ Judge rates it as helpful and appropriate - ❌ Contains fabricated facts


The Verification Spectrum

Not all verification is equal. Approaches vary in their independence from the LLM being verified:

Verification Spectrum

Fully Dependent (No Independent Verification)

  • Self-assessment: Model evaluates its own output
  • LLM-as-judge: Another LLM evaluates output
  • Risk: Same failure modes, correlated errors

Partially Independent

  • Self-consistency: Multiple samples, check agreement
  • Retrieval grounding: RAG with trusted sources
  • Risk: Confident errors still pass; retrieval can fail

Fully Independent

  • Formal verification: Mathematical proof against rules
  • Knowledge graph lookup: Structured fact verification
  • External API calls: Database/calculator validation
  • Token-level uncertainty: Model's internal confidence signals
  • Risk: Limited coverage; not all claims are verifiable

Current Solutions and Their Limits

Formal Verification (AWS Automated Reasoning)

What it does: Translates policy documents into formal logic, mathematically verifies LLM outputs against rules.

Claimed accuracy: Up to 99% for policy compliance verification.

Limitations: - Requires well-structured source documents - Complex/contradictory rules don't extract cleanly - Only verifies against your rules, not general truth - Adds latency to every request - Documents limited to 50,000 characters

Best for: Regulated industries with clear, documented policies (HR, finance, compliance).

Risk rating: ⬤⬤⬤⬤○ (High effectiveness for scoped domains)


Knowledge Graph Grounding

What it does: Maps claims to structured knowledge, verifies relationships exist in graph.

How it works: Extract subject-predicate-object from claim → Query knowledge graph → Verify relationship exists.

Limitations: - Coverage limited to what's in the graph - Struggles with temporal facts (things that change) - Entity disambiguation is hard - Complex reasoning chains degrade accuracy - Building/maintaining KG is expensive

Best for: Domain-specific factual verification (medical, legal, scientific).

Risk rating: ⬤⬤⬤○○ (Good for covered domains, poor for general knowledge)


Token-Level Detection (HaluGate, etc.)

What it does: Analyzes model internals (attention patterns, hidden states, token probabilities) to detect uncertainty.

Claimed accuracy: ~96% for determining if fact-checking is needed; varies for actual detection.

Limitations: - Requires white-box access to model - Needs context to verify against (RAG scenario) - Pre-classification only identifies what to check, not whether it's true - Novel model architectures may not transfer

Best for: Production RAG systems where you have source documents.

Risk rating: ⬤⬤⬤○○ (Fast and cheap, but context-dependent)


Self-Consistency Checking

What it does: Generate multiple responses, check if they agree.

How it works: Ask the same question 5 times with temperature > 0 → Measure semantic similarity → Flag inconsistent answers.

Limitations: - Confident hallucinations are consistently wrong - 5x latency and cost - Agreement ≠ correctness - Doesn't work for deterministic (temp=0) deployments

Best for: Catching uncertain responses, not verifying facts.

Risk rating: ⬤⬤○○○ (Catches some errors, misses confident ones)


LLM-as-Judge

What it does: Uses a second (often larger) model to evaluate outputs.

Claimed accuracy: 70-90% depending on task and judge model.

Limitations: - Same fundamental vulnerabilities as primary LLM - Can be fooled by same adversarial techniques - Adds significant latency and cost - Judge may have different biases, not fewer biases - "Grading your own homework with a different pencil"

Best for: Style, safety, and appropriateness — not factual accuracy.

Risk rating: ⬤⬤○○○ (Useful layer, but not independent verification)


Effectiveness Matrix

Approach Independence Accuracy Latency Cost Coverage
Formal Verification ⬤⬤⬤⬤⬤ ~99% +200-500ms $$ Narrow (policy)
Knowledge Graph ⬤⬤⬤⬤⬤ ~92% +50-100ms $$$ (build) Narrow (KG)
Token Detection ⬤⬤⬤⬤○ ~96% +80-160ms $ RAG only
Self-Consistency ⬤⬤⬤○○ ~80% +5x 5x All
LLM-as-Judge ⬤○○○○ ~80% +500ms-5s $$ All
Pattern Guardrails ⬤⬤⬤○○ ~85% +10-50ms $ Known patterns

Legend: Independence = how much verification relies on LLM reasoning (5 = fully independent)


Implications for System Design

The Uncomfortable Truth

No single verification approach covers everything: - Formal verification only works for documented rules - Knowledge graphs only cover what's in them - Token detection only works with RAG context - Self-consistency misses confident errors - LLM-as-judge shares LLM vulnerabilities

Defense in Depth Isn't Enough

Stacking approaches with the same blind spots doesn't help:

Bad: Guardrails → LLM → LLM-Judge → LLM-Reviewer
     (All rely on LLM reasoning — correlated failures)

Better: Guardrails → LLM → Formal Verify → KG Check → Human Sample
        (Multiple independent verification methods)

Practical Recommendations

For Tier 4 (Critical) applications: - Require at least one fully independent verification method - Formal verification for rule compliance - Knowledge graph or external API for factual claims - Human review for edge cases

For Tier 3 (Significant) applications: - Token-level detection for RAG scenarios - Self-consistency for open-ended generation - Sampling-based human review

For Tier 2 (Moderate) applications: - LLM-as-judge is acceptable for most checks - Consider formal verification for high-stakes subsets

For Tier 1 (Minimal) applications: - Standard guardrails sufficient - LLM-as-judge for quality monitoring


The Path Forward

The industry is converging on a realization: verification requires multiple independent signals.

Emerging patterns include:

  1. Hybrid pipelines: Combine cheap/fast checks (guardrails) with expensive/accurate checks (formal verification) based on claim type

  2. Claim decomposition: Break responses into atomic claims, route each to appropriate verifier

  3. Confidence-based escalation: Use model uncertainty to trigger deeper verification

  4. Domain-specific verification: Build narrow, high-accuracy verifiers for specific claim types rather than trying to verify everything

The verification gap won't be closed by a single solution. It will be narrowed by thoughtful combination of independent verification methods, matched to the types of claims your application makes.


Key Takeaways

  1. Current safety layers don't verify truth — they verify compliance, safety, and style
  2. LLM-based verification shares LLM vulnerabilities — it's not independent
  3. Fully independent verification exists but has narrow coverage (formal logic, knowledge graphs)
  4. No single approach covers everything — hybrid pipelines are necessary
  5. Match verification to claim type — factual claims need different verification than policy compliance

The question isn't "which verification method should we use?" It's "which verification methods cover the claims our application makes?"



AI Runtime Behaviour Security, 2026 (Jonathan Gill).