Skip to content

Cost and Latency

The three-layer pattern is not free. Budget for it.

The Problem

Each layer adds cost and latency:

Layer Latency Added Cost Per Request At 1M requests/month
Guardrails (rule-based) 5–20ms ~$0 (compute only) Negligible
Guardrails (ML classifier) 20–100ms $0.001–0.005 $1K–5K
Judge (LLM evaluation) 500ms–5s $0.01–0.05 $10K–50K
Human Oversight (per review) Minutes–hours $5–50 per review Depends on sample rate

For a Tier 3 system running the full pattern on every request, the Judge alone can cost more than the generator.


Sampling Strategies

You don't have to judge every request. Match evaluation density to risk.

By Risk Tier

Risk Tier Guardrails Judge Human Review
Tier 1 (Low) 100% of requests 5–10% sample 1% or anomaly-triggered
Tier 2 (Medium) 100% of requests 25–50% sample 5% + all judge flags
Tier 3 (High) 100% of requests 100% of requests 10% + all judge flags

Adaptive Sampling

Increase judge evaluation rate when signals indicate elevated risk:

Trigger Sampling Adjustment
Guardrail block rate above baseline Increase judge rate by 2x
New user (first 50 requests) Judge 100%
After-hours usage (if unusual for your environment) Increase judge rate by 2x
Prompt attack detected Judge 100% for that user for 24 hours
Model provider change notification Judge 100% for 48 hours

Stratified Sampling

Not all requests carry equal risk. Sample by category:

Request Type Judge Rate Rationale
FAQ / simple lookup 5% Low risk, repetitive
Creative generation 25% More variable, higher guardrail miss rate
Data analysis / summarisation 50% Accesses user data, exfiltration risk
Decision support 100% Consequential output
Actions / tool use 100% Real-world impact

Latency Budgets

Design your latency budget before adding controls.

Example: Customer-Facing Chat (Tier 2, Streaming)

Component Budget Actual
Input guardrails 20ms 15ms (rule-based)
LLM generation (first token) 500ms 400ms
Buffer evaluation (per chunk) 50ms 30ms (rule-based)
Total to first visible token 570ms 445ms
Post-stream judge evaluation N/A (async) 2s

Example: Document Processing (Tier 3, Non-Streaming)

Component Budget Actual
Input guardrails 100ms 50ms
LLM generation (complete) 10s 8s
Output guardrails 100ms 60ms
Judge evaluation 5s 3s
Total before delivery 15.2s 11.1s

What Breaks the Budget

Problem Cause Mitigation
Judge adds 5s to every request Using large model for judge Use smaller model (Haiku-class) for routine evaluation
Guardrail latency spikes ML classifier cold start Pre-warm classifiers, use rule-based for latency-critical path
Multiple judge calls per request Evaluating multiple dimensions separately Batch evaluations into a single prompt
Human review blocks delivery Synchronous human review on all flags Async review for medium flags; synchronous only for high/critical

Cost Optimisation

Judge Model Selection

Judge Model Tier Cost (per 1K eval tokens) Accuracy When to Use
Small (Haiku, GPT-4o-mini) ~$0.001 80–85% Tier 1, high-volume screening
Medium (Sonnet, GPT-4o) ~$0.01 88–93% Tier 2, balanced cost/accuracy
Large (Opus, GPT-4) ~$0.05 93–97% Tier 3, consequential decisions

Tiered Evaluation

Run cheap evaluation first; escalate to expensive evaluation only when needed:

Request → Rule-based guardrails (free, fast)
  ↓ (passed)
Request → Small model judge (cheap, fast)
  ↓ (flagged or uncertain)
Request → Large model judge (expensive, accurate)
  ↓ (flagged)
Request → Human review (most expensive)

This reduces cost by 60–80% compared to running the large model on everything.

Caching

Judge evaluations on identical or near-identical inputs can be cached:

Cache Type Hit Rate Risk
Exact match (same input hash) Low (5–10%) None
Semantic similarity (embedding distance < threshold) Medium (15–30%) Adversarial inputs designed to be semantically similar but functionally different

Only cache for Tier 1. For Tier 2–3, the risk of cache-based bypass outweighs the cost saving.


Budgeting Template

Line Item Monthly Estimate
Generator LLM API costs $ ___
Input guardrails (if ML-based) $ ___
Output guardrails (if ML-based) $ ___
Judge LLM API costs (at sampling rate ___%) $ ___
Human review (estimated ___ reviews × $___/review) $ ___
Monitoring infrastructure (SIEM, dashboards) $ ___
Total security overhead $ ___
As % of generator cost ____%

Rule of thumb: Security overhead is typically 15–40% of generator cost for Tier 2, and 40–100% for Tier 3.

AI Runtime Behaviour Security, 2026 (Jonathan Gill).