Streaming Controls¶
Validating output that hasn't finished yet.
This document uses the simplified three-tier system (Tier 1/2/3). See Risk Tiers — Simplified Tier Mapping for the mapping to LOW/MEDIUM/HIGH/CRITICAL.
The Problem¶
Most LLM deployments use streaming (Server-Sent Events) to deliver tokens incrementally. Users see output as it's generated, typically 20–80ms per token.
The three-layer pattern assumes you can evaluate a complete response:
- Guardrails check the full output → Can't. It's not complete.
- Judge evaluates the full output → Can't. Same reason.
- Human reviews if flagged → User already saw it.
By the time evaluation is possible, the content is already on screen.
Why This Matters¶
For Tier 1 (low risk) — internal chatbots, content drafting — this is acceptable. Users understand they're interacting with a draft.
For Tier 2–3 — customer-facing, regulated, consequential — delivering unevaluated content to users is a control gap.
Control Patterns¶
Pattern 1: Buffer and Release¶
Hold tokens in a server-side buffer. Release in chunks after evaluation.
LLM → Buffer (N tokens) → Guardrail check → Release to client → Buffer next N tokens → ...
| Parameter | Setting | Trade-off |
|---|---|---|
| Buffer size | 50–100 tokens (~1–2 sentences) | Larger = better detection, worse perceived latency |
| Evaluation | Text guardrails on buffer content | Fast rules only — no LLM judge in the loop |
| Release trigger | Guardrail pass OR buffer timeout | Timeout prevents indefinite blocking |
| Block action | Replace buffer content with safe message | User sees interruption, not harmful content |
Latency impact: 200–500ms additional delay per chunk. Acceptable for most use cases.
Limitation: Evaluation is per-chunk, not whole-response. Context-dependent violations (safe first half, harmful when combined with second half) will be missed.
Pattern 2: Stream with Post-Hoc Evaluation¶
Deliver tokens immediately. Evaluate the complete response asynchronously. Retract or flag if evaluation fails.
LLM → Client (real-time)
↓ (parallel)
LLM → Complete response → Judge evaluation
↓ (if flagged)
Client ← Retraction/warning message
| Step | Timing | What Happens |
|---|---|---|
| Streaming delivery | Real-time | User sees output immediately |
| Response completion | 2–30s after stream starts | Full response available for evaluation |
| Judge evaluation | 500ms–5s after completion | Async quality check |
| Retraction (if needed) | 3–35s after first token | Warning appended, content flagged in UI |
Limitation: User has already seen the content. Retraction limits further harm but doesn't prevent initial exposure.
When to use: Tier 1–2 where the risk of initial exposure is acceptable and the primary goal is audit trail and pattern detection, not prevention.
Pattern 3: Non-Streaming for High Risk¶
Don't stream. Generate the complete response, evaluate it, then deliver.
LLM → Complete response → Guardrails → Judge → Deliver (or block)
| Trade-off | Impact |
|---|---|
| Latency | User waits 3–30s for response with no incremental feedback |
| UX | Worse perceived performance — mitigate with loading indicators |
| Safety | Full evaluation before delivery — strongest control |
When to use: Tier 3 and any Tier 2 use case where content reaches customers, regulators, or triggers consequential actions.
Choosing the Right Pattern¶
| Risk Tier | Recommended Pattern | Rationale |
|---|---|---|
| Tier 1 (Internal, low risk) | Stream with post-hoc evaluation | Users are employees, risk is low, UX matters |
| Tier 2 (Business impact) | Buffer and release | Balance between safety and responsiveness |
| Tier 2 (Customer-facing) | Non-streaming OR buffer-and-release | Customer exposure requires pre-delivery evaluation |
| Tier 3 (Regulated, consequential) | Non-streaming | Full evaluation before any content is delivered |
Implementation Notes¶
Buffer and Release — Server-Side¶
# Pseudocode — adapt to your framework
buffer = []
buffer_size = 75 # tokens
async for token in llm.stream(prompt):
buffer.append(token)
if len(buffer) >= buffer_size:
chunk_text = "".join(buffer)
if guardrail.check(chunk_text).passed:
yield chunk_text
buffer = []
else:
yield "[Content filtered by safety controls]"
# Log the blocked content for review
log_blocked_content(chunk_text, request_id)
break # Stop generation
# Flush remaining buffer
if buffer:
chunk_text = "".join(buffer)
if guardrail.check(chunk_text).passed:
yield chunk_text
Post-Hoc Retraction — Client-Side¶
The client must support retraction. This means:
- Response IDs — Every streamed response has a unique ID
- Retraction channel — WebSocket or SSE channel for retraction messages
- UI handling — Client replaces or flags retracted content
- Audit logging — Both the original content and retraction are logged
What You Lose with Streaming¶
| Capability | Impact on Streaming |
|---|---|
| Full-response guardrails | Degraded — chunk-level only (Pattern 1) or post-hoc (Pattern 2) |
| Judge evaluation | Async only — cannot block delivery in real-time |
| Consistent quality scoring | Scores apply to complete response, available only after stream ends |
| Deterministic safety | You cannot guarantee no user ever sees problematic content in a stream |
This is an inherent trade-off. If you need deterministic safety, don't stream.¶
AI Runtime Behaviour Security, 2026 (Jonathan Gill).