Skip to content

5. Verification & Evidence

1What goes wrong
2Controls miss it
3Epistemic integrity
4MASO controls
5Verification

After this module you will be able to

  • Define what constitutes evidence that a MASO control is working
  • Design verification patterns for each layer of the three-layer architecture
  • Distinguish between compliance evidence and operational evidence
  • Identify when a control is present but not effective

Controls aren't enough; you need evidence

Deploying a control is one step. Knowing it's working is another. The Phantom Compliance incident is a case study in controls that existed but didn't verify the right thing.

Meridian Capital had:

  • Guardrails ✓
  • Logging ✓
  • Output quality checks ✓
  • Security review ✓

They could demonstrate to an auditor that controls were deployed. What they couldn't demonstrate was that the controls were effective against the actual threat. That's the difference between compliance evidence and operational evidence.


The evidence hierarchy

Level What it proves Example
Deployment evidence The control exists "We have guardrails on Agent B"
Activity evidence The control ran "Guardrails processed 4,200 requests today"
Effectiveness evidence The control caught something "Guardrails blocked 12 requests with incomplete retrieval metadata"
Coverage evidence The control covers the threat model "Retrieval completeness is checked for all data sources Agent B accesses, including sanctions lists, restricted securities, and concentration data"

Most organisations stop at activity evidence. You need coverage evidence.


Verification patterns by layer

Layer 1: Guardrail verification

Guardrails are fast (~10ms) and deterministic. Verification is about proving they cover the right inputs:

Pattern: Canary testing

Inject known-bad inputs on a schedule, including inputs that mimic Phantom Compliance conditions:

  • Deliberately truncated retrieval results
  • Stale cached data with current timestamps
  • Well-formed but incomplete API responses

If the guardrails don't catch the canary, you have a gap.

Evidence artefact: Canary test results with pass/fail by threat category, run frequency, and gap identification.

Layer 2: Model-as-Judge verification

The Judge layer evaluates outputs that guardrails can't catch: subtle reasoning failures, unwarranted confidence, inter-agent inconsistency.

Pattern: Judge calibration

Regularly test the Judge with labelled examples:

  • Outputs from complete data (should pass)
  • Outputs from incomplete data that look identical (should fail)
  • Outputs with inflated confidence markers (should flag)

Track the Judge's true positive and false positive rates. If the Judge can't distinguish complete from incomplete reasoning, it needs recalibration.

Evidence artefact: Judge calibration report with sensitivity/specificity by failure mode.

When the Judge can be fooled

The AIRS framework includes a dedicated section on when the Judge can be fooled. Key risk: if the Judge evaluates the text of Agent B's output without access to Agent B's retrieval metadata, it will be fooled by the same well-formatted output that fooled Agent C. The Judge must have access to reasoning-basis metadata, not just output text.

Layer 3: Human oversight verification

Human oversight is the last line of defence. Verification is about proving humans have the information they need to actually catch failures:

Pattern: Decision audit

Review a sample of human oversight decisions:

  • Did the human reviewer have access to retrieval completeness data?
  • When retrieval was incomplete, did the reviewer catch it?
  • How long did the reviewer spend: enough to check reasoning basis, or just enough to skim the output?

Evidence artefact: Human review audit with time-to-decision, information availability, and catch rate for known failure conditions.


Chain-level verification

Beyond per-layer verification, you need evidence that the chain as a whole maintains integrity:

The chain integrity test

Define a set of end-to-end test cases that exercise the entire chain with known-good and known-bad inputs:

  1. Complete data, correct decision (baseline): the chain should approve
  2. Incomplete data, plausible output (the Phantom Compliance case): the chain should escalate or reject
  3. Complete data, edge case (a borderline trade that requires nuanced judgement): the chain should escalate for human review
  4. Stale data, correct format (data that was valid yesterday but isn't today): the chain should detect the staleness

Run these tests regularly, not just at deployment, but on a schedule. Agent behaviour can drift as models are updated, data sources change, or context patterns shift.

Evidence artefact: Chain integrity test results, run weekly, with trend analysis on catch rates.


The verification principle: A control is only as good as your evidence that it works against your actual threat model. Deployment evidence proves you spent the money. Coverage evidence proves it was worth spending.


Reflection

If you were asked to demonstrate that your AI pipeline's controls would catch a Phantom Compliance-style failure, what evidence would you present? Could you show coverage evidence, not just deployment evidence?


Next: Decision Exercise →