Reasoning Model Controls¶
Addendum for Emerging Controls — practical guidance for reasoning models.
This document uses the simplified three-tier system (Tier 1/2/3). See Risk Tiers — Simplified Tier Mapping for the mapping to LOW/MEDIUM/HIGH/CRITICAL.
The Problem¶
Reasoning models (OpenAI o1/o3, Claude with extended thinking, Gemini with thinking mode) generate internal chain-of-thought before producing a response. The reasoning trace may:
- Contain content that contradicts system instructions
- Reveal information the final output was designed to withhold
- Show the model "considering" harmful approaches before selecting a safe one
- Be manipulated to bypass safety controls that only check the final output
Current State of Access¶
| Provider | Reasoning Trace Access | Implication |
|---|---|---|
| OpenAI (o1/o3) | Not exposed via API | Cannot monitor reasoning — output-only evaluation |
| Anthropic (extended thinking) | Exposed as thinking blocks |
Can monitor, but content is explicitly marked as draft reasoning |
| Google (Gemini thinking) | Partially exposed | Limited inspection capability |
| Open-source (Qwen QwQ, etc.) | Fully visible | Full monitoring possible |
Controls¶
When Reasoning Traces Are Not Accessible¶
You're evaluating a black box. Your controls are output-focused:
| Control | Implementation |
|---|---|
| Strengthen output evaluation | Judge criteria should include consistency checks: "Does this response contradict the system prompt?" |
| Behavioral testing | Periodically test with adversarial inputs designed to exploit reasoning-level bypasses |
| Comparative evaluation | Compare reasoning model outputs against non-reasoning model outputs on the same inputs; flag divergences |
| Provider assurance | Evaluate provider's claims about reasoning trace safety filtering |
When Reasoning Traces Are Accessible¶
Additional controls become possible:
| Control | Implementation |
|---|---|
| Trace scanning | Apply guardrails to reasoning traces, not just final outputs |
| Instruction adherence check | Verify the reasoning trace doesn't contain evidence of attempting to bypass system instructions |
| Sensitive data in trace | Check reasoning traces for PII or confidential data that shouldn't appear even in internal reasoning |
| Reasoning-output consistency | Flag cases where the reasoning leads to one conclusion but the output states another |
Practical Limitations¶
Be honest about what you can't do:
- Reasoning traces are not ground truth. The model may "think" one thing and output another — this is by design (safety training teaches models to reason past harmful thoughts).
- Monitoring traces at scale is expensive. Reasoning traces can be 10–100x longer than the final output.
- Most reasoning traces are benign. The signal-to-noise ratio for trace monitoring is very low.
Recommendation by Risk Tier¶
| Risk Tier | Reasoning Model Approach |
|---|---|
| Tier 1 | Use reasoning models freely; output-only evaluation sufficient |
| Tier 2 | Output-focused evaluation with periodic behavioral testing |
| Tier 3 | If traces are accessible: sample-based trace monitoring. If not: enhanced output evaluation + adversarial testing. Consider whether a reasoning model is necessary — a non-reasoning model with stronger output controls may be more governable |
AI Runtime Behaviour Security, 2026 (Jonathan Gill).