Behavioral Anomaly Detection — Operational Guide¶
From metrics collection to actionable alerts.
The Gap¶
The framework describes Behavioral Anomaly Detection as a concept: aggregate signals to detect drift from normal. The Operational Metrics extension describes what to measure.
This document bridges the gap: how to turn collected metrics into anomaly detection that feeds your SOC.
Step 1: Establish Baselines¶
You can't detect anomalies without a definition of normal.
Baseline Period¶
| System Maturity | Baseline Period | Rationale |
|---|---|---|
| New deployment (<1 month) | 2 weeks minimum | Need enough data for statistical significance |
| Established (1–6 months) | Rolling 30-day window | Captures normal variation |
| Mature (>6 months) | Rolling 30-day, seasonally adjusted | Accounts for weekly/monthly patterns |
Baseline Metrics¶
| Metric | What "Normal" Looks Like | Measurement |
|---|---|---|
| Requests per user per hour | Distribution of usage frequency | Mean, standard deviation, P95 |
| Token volume per request | Typical input/output length | Mean, P50, P95, P99 |
| Guardrail block rate | Percentage of requests blocked | Daily rate, 7-day rolling average |
| Judge flag rate | Percentage of responses flagged | Daily rate, 7-day rolling average |
| Response latency | Time to first token, time to completion | P50, P95, P99 |
| Topic distribution | What subjects users ask about | Top-N topic clusters, relative frequency |
| Error rate | Failed requests, timeouts, retries | Daily rate |
| Unique users per day | Active user count | Daily count, 7-day rolling average |
Statistical Approach¶
For most metrics, a simple z-score against the rolling baseline is sufficient:
z = (observed_value - baseline_mean) / baseline_stddev
| z-score | Interpretation | Action |
|---|---|---|
| < 2.0 | Normal variation | No action |
| 2.0–3.0 | Unusual | Log, include in weekly review |
| 3.0–4.0 | Anomalous | Alert AI team, investigate |
| > 4.0 | Highly anomalous | Alert SOC, immediate investigation |
For rate metrics (block rate, flag rate): use control charts (SPC). Upper control limit = mean + 3σ.
For count metrics (requests per user): use Poisson or negative binomial models for low-count data.
Step 2: Define Detection Rules¶
User-Level Anomalies¶
| Rule | Detection | Likely Cause |
|---|---|---|
| Request volume spike (single user) | Requests > P99 baseline for that user | Automation, data extraction, or testing |
| Token volume spike (single user) | Output tokens > P99 | Attempting to extract large amounts of data |
| Guardrail block spike (single user) | >5 blocks in 1 hour | Probing for bypass, adversarial testing |
| New user, high volume | First-day usage > P95 of established users | Legitimate power user or compromised account |
| Topic shift | User's topic cluster changes significantly | Role change (legitimate) or account takeover |
System-Level Anomalies¶
| Rule | Detection | Likely Cause |
|---|---|---|
| Global block rate increase | Block rate > UCL (3σ) | New attack pattern, model update, or guardrail misconfiguration |
| Global flag rate increase | Flag rate > UCL (3σ) | Model degradation, provider update, or emerging misuse pattern |
| Latency increase | P95 latency > 2x baseline | Provider issues, resource exhaustion, or DDoS |
| Error rate spike | Error rate > UCL (3σ) | Provider outage, config change, or infrastructure issue |
| Output distribution shift | Cosine similarity of topic distribution vs baseline < 0.8 | Model update, prompt injection campaign, or data drift |
Model-Level Anomalies (Judge Assurance)¶
| Rule | Detection | Likely Cause |
|---|---|---|
| Judge agreement rate drop | Agreement < baseline - 2σ | Judge model update, generator model update, or calibration drift |
| Judge false positive spike | FP rate > UCL | Judge prompt degradation or input distribution shift |
| Judge latency increase | P95 > 2x baseline | Provider issues affecting judge model |
Step 3: Alert and Respond¶
Alert Routing¶
| Anomaly Type | Severity | Route To |
|---|---|---|
| User-level, single metric | Low | AI platform team (weekly review) |
| User-level, multiple metrics | Medium | SOC L1 for triage |
| User-level, high volume + blocks | High | SOC L2 + AI security |
| System-level, single metric | Medium | AI platform team (same-day review) |
| System-level, multiple metrics | High | SOC L2 + AI platform team |
| Model-level (judge drift) | Medium | AI security team |
False Positive Management¶
Anomaly detection will generate false positives. Plan for this:
| Strategy | Implementation |
|---|---|
| Tune thresholds gradually | Start at 4σ (fewer alerts), tighten to 3σ as you gain confidence |
| Allowlisting | Known batch jobs, automated testing, and power users get adjusted baselines |
| Correlation | Single-metric anomalies are logged; multi-metric anomalies are alerted |
| Feedback loop | Analysts mark alerts as TP/FP; use this to adjust thresholds quarterly |
| Alert fatigue monitoring | Track time-to-acknowledge; if it degrades, you have too many alerts |
Step 4: Tooling¶
Build vs. Buy¶
| Approach | When | Tools |
|---|---|---|
| Extend existing SIEM | You have Splunk/Sentinel/Elastic and AI logs are already ingested | SIEM detection rules + custom dashboards |
| Dedicated AI monitoring | High AI deployment density, need specialised detection | Galileo, Arize, Langfuse, WhyLabs |
| Custom pipeline | Unique requirements or cost constraints | Prometheus + Grafana + custom detectors |
Minimum Viable Implementation¶
- Week 1: Emit structured AI logs to your existing SIEM (see SOC Integration)
- Week 2–3: Collect baseline metrics (2-week minimum)
- Week 4: Implement 3–5 highest-value detection rules (start with user-level volume anomalies and system-level block rate)
- Week 5–8: Tune thresholds based on false positive rate
- Ongoing: Add detection rules as new patterns emerge
AI Runtime Behaviour Security, 2026 (Jonathan Gill).