Cost and Latency¶

The three-layer pattern is not free. Budget for it.

The Problem¶

Each layer adds cost and latency:

Layer	Latency Added	Cost Per Request	At 1M requests/month
Guardrails (rule-based)	5–20ms	~$0 (compute only)	Negligible
Guardrails (ML classifier)	20–100ms	$0.001–0.005	$1K–5K
Judge (LLM evaluation)	500ms–5s	$0.01–0.05	$10K–50K
Human Oversight (per review)	Minutes–hours	$5–50 per review	Depends on sample rate

For a Tier 3 system running the full pattern on every request, the Judge alone can cost more than the generator.

Sampling Strategies¶

You don't have to judge every request. Match evaluation density to risk.

By Risk Tier¶

Risk Tier	Guardrails	Judge	Human Review
Tier 1 (Low)	100% of requests	5–10% sample	1% or anomaly-triggered
Tier 2 (Medium)	100% of requests	25–50% sample	5% + all judge flags
Tier 3 (High)	100% of requests	100% of requests	10% + all judge flags

Adaptive Sampling¶

Increase judge evaluation rate when signals indicate elevated risk:

Trigger	Sampling Adjustment
Guardrail block rate above baseline	Increase judge rate by 2x
New user (first 50 requests)	Judge 100%
After-hours usage (if unusual for your environment)	Increase judge rate by 2x
Prompt attack detected	Judge 100% for that user for 24 hours
Model provider change notification	Judge 100% for 48 hours

Stratified Sampling¶

Not all requests carry equal risk. Sample by category:

Request Type	Judge Rate	Rationale
FAQ / simple lookup	5%	Low risk, repetitive
Creative generation	25%	More variable, higher guardrail miss rate
Data analysis / summarisation	50%	Accesses user data, exfiltration risk
Decision support	100%	Consequential output
Actions / tool use	100%	Real-world impact

Latency Budgets¶

Design your latency budget before adding controls.

Example: Customer-Facing Chat (Tier 2, Streaming)¶

Component	Budget	Actual
Input guardrails	20ms	15ms (rule-based)
LLM generation (first token)	500ms	400ms
Buffer evaluation (per chunk)	50ms	30ms (rule-based)
Total to first visible token	570ms	445ms
Post-stream judge evaluation	N/A (async)	2s

Example: Document Processing (Tier 3, Non-Streaming)¶

Component	Budget	Actual
Input guardrails	100ms	50ms
LLM generation (complete)	10s	8s
Output guardrails	100ms	60ms
Judge evaluation	5s	3s
Total before delivery	15.2s	11.1s

What Breaks the Budget¶

Problem	Cause	Mitigation
Judge adds 5s to every request	Using large model for judge	Use smaller model (Haiku-class) for routine evaluation
Guardrail latency spikes	ML classifier cold start	Pre-warm classifiers, use rule-based for latency-critical path
Multiple judge calls per request	Evaluating multiple dimensions separately	Batch evaluations into a single prompt
Human review blocks delivery	Synchronous human review on all flags	Async review for medium flags; synchronous only for high/critical

Cost Optimisation¶

Judge Model Selection¶

Judge Model Tier	Cost (per 1K eval tokens)	Accuracy	When to Use
Small (Haiku, GPT-4o-mini)	~$0.001	80–85%	Tier 1, high-volume screening
Medium (Sonnet, GPT-4o)	~$0.01	88–93%	Tier 2, balanced cost/accuracy
Large (Opus, GPT-4)	~$0.05	93–97%	Tier 3, consequential decisions

Tiered Evaluation¶

Run cheap evaluation first; escalate to expensive evaluation only when needed:

Request → Rule-based guardrails (free, fast)
  ↓ (passed)
Request → Small model judge (cheap, fast)
  ↓ (flagged or uncertain)
Request → Large model judge (expensive, accurate)
  ↓ (flagged)
Request → Human review (most expensive)

This reduces cost by 60–80% compared to running the large model on everything.

Caching¶

Judge evaluations on identical or near-identical inputs can be cached:

Cache Type	Hit Rate	Risk
Exact match (same input hash)	Low (5–10%)	None
Semantic similarity (embedding distance < threshold)	Medium (15–30%)	Adversarial inputs designed to be semantically similar but functionally different

Only cache for Tier 1. For Tier 2–3, the risk of cache-based bypass outweighs the cost saving.

Budgeting Template¶

Line Item	Monthly Estimate
Generator LLM API costs	$ ___
Input guardrails (if ML-based)	$ ___
Output guardrails (if ML-based)	$ ___
Judge LLM API costs (at sampling rate ___%)	$ ___
Human review (estimated ___ reviews × $___/review)	$ ___
Monitoring infrastructure (SIEM, dashboards)	$ ___
Total security overhead	$ ___
As % of generator cost	____%

Rule of thumb: Security overhead is typically 15–40% of generator cost for Tier 2, and 40–100% for Tier 3.¶

AI Runtime Behaviour Security, 2026 (Jonathan Gill).