Behavioral Anomaly Detection — Operational Guide¶

From metrics collection to actionable alerts.

The Gap¶

The framework describes Behavioral Anomaly Detection as a concept: aggregate signals to detect drift from normal. The Operational Metrics extension describes what to measure.

This document bridges the gap: how to turn collected metrics into anomaly detection that feeds your SOC.

You can't detect anomalies without a definition of normal.

System Maturity	Baseline Period	Rationale
New deployment (<1 month)	2 weeks minimum	Need enough data for statistical significance
Established (1–6 months)	Rolling 30-day window	Captures normal variation
Mature (>6 months)	Rolling 30-day, seasonally adjusted	Accounts for weekly/monthly patterns

Metric	What "Normal" Looks Like	Measurement
Requests per user per hour	Distribution of usage frequency	Mean, standard deviation, P95
Token volume per request	Typical input/output length	Mean, P50, P95, P99
Guardrail block rate	Percentage of requests blocked	Daily rate, 7-day rolling average
Judge flag rate	Percentage of responses flagged	Daily rate, 7-day rolling average
Response latency	Time to first token, time to completion	P50, P95, P99
Topic distribution	What subjects users ask about	Top-N topic clusters, relative frequency
Error rate	Failed requests, timeouts, retries	Daily rate
Unique users per day	Active user count	Daily count, 7-day rolling average

For most metrics, a simple z-score against the rolling baseline is sufficient:

z = (observed_value - baseline_mean) / baseline_stddev

z-score	Interpretation	Action
< 2.0	Normal variation	No action
2.0–3.0	Unusual	Log, include in weekly review
3.0–4.0	Anomalous	Alert AI team, investigate
> 4.0	Highly anomalous	Alert SOC, immediate investigation

For rate metrics (block rate, flag rate): use control charts (SPC). Upper control limit = mean + 3σ.

For count metrics (requests per user): use Poisson or negative binomial models for low-count data.

Rule	Detection	Likely Cause
Request volume spike (single user)	Requests > P99 baseline for that user	Automation, data extraction, or testing
Token volume spike (single user)	Output tokens > P99	Attempting to extract large amounts of data
Guardrail block spike (single user)	>5 blocks in 1 hour	Probing for bypass, adversarial testing
New user, high volume	First-day usage > P95 of established users	Legitimate power user or compromised account
Topic shift	User's topic cluster changes significantly	Role change (legitimate) or account takeover

Rule	Detection	Likely Cause
Global block rate increase	Block rate > UCL (3σ)	New attack pattern, model update, or guardrail misconfiguration
Global flag rate increase	Flag rate > UCL (3σ)	Model degradation, provider update, or emerging misuse pattern
Latency increase	P95 latency > 2x baseline	Provider issues, resource exhaustion, or DDoS
Error rate spike	Error rate > UCL (3σ)	Provider outage, config change, or infrastructure issue
Output distribution shift	Cosine similarity of topic distribution vs baseline < 0.8	Model update, prompt injection campaign, or data drift

Rule	Detection	Likely Cause
Judge agreement rate drop	Agreement < baseline - 2σ	Judge model update, generator model update, or calibration drift
Judge false positive spike	FP rate > UCL	Judge prompt degradation or input distribution shift
Judge latency increase	P95 > 2x baseline	Provider issues affecting judge model

Anomaly Type	Severity	Route To
User-level, single metric	Low	AI platform team (weekly review)
User-level, multiple metrics	Medium	SOC L1 for triage
User-level, high volume + blocks	High	SOC L2 + AI security
System-level, single metric	Medium	AI platform team (same-day review)
System-level, multiple metrics	High	SOC L2 + AI platform team
Model-level (judge drift)	Medium	AI security team

Anomaly detection will generate false positives. Plan for this:

Strategy	Implementation
Tune thresholds gradually	Start at 4σ (fewer alerts), tighten to 3σ as you gain confidence
Allowlisting	Known batch jobs, automated testing, and power users get adjusted baselines
Correlation	Single-metric anomalies are logged; multi-metric anomalies are alerted
Feedback loop	Analysts mark alerts as TP/FP; use this to adjust thresholds quarterly
Alert fatigue monitoring	Track time-to-acknowledge; if it degrades, you have too many alerts

Approach	When	Tools
Extend existing SIEM	You have Splunk/Sentinel/Elastic and AI logs are already ingested	SIEM detection rules + custom dashboards
Dedicated AI monitoring	High AI deployment density, need specialised detection	Galileo, Arize, Langfuse, WhyLabs
Custom pipeline	Unique requirements or cost constraints	Prometheus + Grafana + custom detectors

Week 1: Emit structured AI logs to your existing SIEM (see SOC Integration)
Week 2–3: Collect baseline metrics (2-week minimum)
Week 4: Implement 3–5 highest-value detection rules (start with user-level volume anomalies and system-level block rate)
Week 5–8: Tune thresholds based on false positive rate
Ongoing: Add detection rules as new patterns emerge

AI Runtime Behaviour Security, 2026 (Jonathan Gill).