Behavioral Anomaly Detection¶
Aggregating safety signals to detect when agent behavior drifts from normal.
The Opportunity¶
Every safety layer generates signals:
| Layer | Signal Type | What It Catches |
|---|---|---|
| Guardrails | Block events | Known-bad patterns |
| LLM-as-Judge | Flag events | Policy violations, quality issues |
| Formal Verification | Validation failures | Rule non-compliance |
| Knowledge Graph | Fact mismatches | Factual errors |
| Token Detection | Uncertainty scores | Low-confidence claims |
| Self-Consistency | Disagreement scores | Unstable responses |
| Human Review | Escalation outcomes | Edge cases, false positives |
Individually, each signal catches specific problems. Aggregated, they reveal behavioral patterns invisible to any single layer.
What Aggregation Enables¶
1. Drift Detection¶
Normal behavior establishes a baseline. Deviation indicates something changed:
- Model drift: New model version produces more guardrail blocks
- Attack campaigns: Spike in similar prompt injection attempts
- Topic sensitivity: Certain subjects trigger disproportionate failures
- User anomalies: Single user probing system boundaries
Week 1: 0.3% of requests flagged
Week 2: 0.3% of requests flagged
Week 3: 0.8% of requests flagged ← Something changed
2. Correlated Failure Discovery¶
When multiple layers flag the same request, it's more significant than any single flag:
Request X:
- Guardrail: PASS (no pattern match)
- Judge: FLAG (inappropriate tone)
- Formal Verify: FAIL (policy violation)
- Human Review: BLOCK
Pattern: Formal + Human correlation = guardrails need updating
3. Unknown Attack Vector Identification¶
Adversaries probe for gaps. Aggregated signals reveal probing patterns before they succeed:
User Y (last 24 hours):
- 47 requests with unusual token patterns
- 12 guardrail near-misses (passed but close to threshold)
- 3 judge flags for "boundary testing" language
- Zero blocks
Assessment: Potential adversarial probing — add to watchlist
4. Effectiveness Measurement¶
Which layers catch what? Where are the gaps?
Last 30 days:
- 1,247 total blocks/flags
- 892 caught by guardrails alone (72%)
- 234 caught by judge after guardrail pass (19%)
- 89 caught by formal verification (7%)
- 32 caught only by human review (2%)
Insight: 28% of issues pass guardrails — judge layer is load-bearing
Architecture¶
Signal Collection¶
Every safety layer emits structured events:
{
"timestamp": "2024-01-15T14:32:01Z",
"request_id": "req_abc123",
"session_id": "sess_xyz789",
"user_id": "user_456",
"model_version": "gpt-4-0125",
"layer": "guardrail",
"event_type": "block",
"category": "prompt_injection",
"confidence": 0.94,
"input_hash": "a1b2c3...",
"metadata": {
"pattern_matched": "ignore_previous",
"input_length": 1247
}
}
Aggregation Pipeline¶
ML Anomaly Detection¶
Baseline modeling: - Historical alert rates by category, user segment, time of day - Normal distribution of confidence scores - Typical failure correlation patterns
Anomaly signals: - Volume anomalies: Alert rate exceeds X standard deviations - Pattern anomalies: New failure signature not seen before - Correlation anomalies: Layers that don't usually correlate suddenly do - User anomalies: Individual behavior deviates from cohort - Temporal anomalies: Unusual time-of-day patterns
Example detections:
| Anomaly Type | Signal | Possible Cause |
|---|---|---|
| Volume spike | 3x guardrail blocks in 1 hour | Coordinated attack, viral jailbreak |
| New pattern | Unknown prompt structure triggering judge | Novel attack vector |
| Correlation shift | KG failures now correlate with judge flags | Model hallucinating in new domain |
| User outlier | One user generating 40% of flags | Adversarial probing |
| Temporal | 4am spike in high-risk requests | Bot activity, different timezone attackers |
Implementation Levels¶
Level 1: Centralized Logging¶
- Collect all safety events in one place
- Basic dashboards and queries
- Manual review of patterns
- Effort: Low | Value: Foundation
Level 2: Automated Alerting¶
- Threshold-based alerts (>X blocks/hour)
- Category-specific monitoring
- On-call integration
- Effort: Medium | Value: Reactive detection
Level 3: Statistical Anomaly Detection¶
- Baseline modeling with moving averages
- Z-score based anomaly flagging
- Seasonal adjustment (time of day, day of week)
- Effort: Medium | Value: Proactive detection
Level 4: ML-Based Pattern Discovery¶
- Unsupervised clustering of failure patterns
- User behavior modeling
- Cross-layer correlation analysis
- Emerging attack signature detection
- Effort: High | Value: Unknown-unknown discovery
What to Track¶
Request-Level Metrics¶
- Pass/block/flag rates by layer
- Confidence score distributions
- Latency impact of safety checks
- False positive rates (from human review)
Session-Level Metrics¶
- Flags per session
- Escalation patterns within session
- Session termination reasons
User-Level Metrics¶
- Flag rate compared to cohort
- Category distribution of flags
- Behavioral trajectory over time
Model-Level Metrics¶
- Performance by model version
- Drift between versions
- Category-specific accuracy
System-Level Metrics¶
- Overall safety layer effectiveness
- Coverage gaps (what passes all checks but fails human review)
- Alert-to-incident conversion rate
Integration with Existing Observability¶
This isn't a separate system — it's an extension of standard observability:
| Traditional Observability | AI Behavioral Monitoring |
|---|---|
| Error rates | Flag rates |
| Latency percentiles | Confidence score percentiles |
| Request tracing | Safety decision tracing |
| Anomaly detection on metrics | Anomaly detection on behaviors |
| Alerting on thresholds | Alerting on behavioral drift |
Platforms already doing this: - Datadog LLM Observability - Arize AI - WhyLabs - Galileo - Langfuse - Weights & Biases
The difference is framing: not just "is the model performing well?" but "is the model behaving safely?"
Privacy and Compliance Considerations¶
Aggregating safety signals creates a detailed behavioral record. Consider:
- Data retention: How long to keep alert data?
- PII in alerts: Scrub or hash user identifiers?
- Access control: Who can query behavioral patterns?
- Audit logging: Track who accessed what analysis?
- Cross-user analysis: Legal basis for cohort comparisons?
The same data that enables security enables surveillance. Design constraints upfront.
Connection to Risk Tiers¶
Monitoring depth should match risk:
| Tier | Monitoring Level | Anomaly Detection |
|---|---|---|
| Tier 1 (Minimal) | Basic logging | Threshold alerts |
| Tier 2 (Moderate) | Aggregated dashboards | Statistical baselines |
| Tier 3 (Significant) | Real-time monitoring | ML anomaly detection |
| Tier 4 (Critical) | Full behavioral analysis | Continuous ML + human review |
The Insider Risk Parallel¶
Enterprise security has been solving this exact problem for humans since 2015. User and Entity Behavior Analytics (UEBA) — originally UBA before Gartner added the "E" — monitors users and non-human entities against behavioural baselines, flags deviations, and scores risk across multiple dimensions.
The "E" is the key. UEBA was extended specifically to cover non-human entities: service accounts, bots, IoT devices, automated processes. Agents are the next entity type. The entire analytical framework transfers.
Three insider threat categories → three agent threat categories¶
Insider risk programs classify threats into three types. Each maps to an agent equivalent:
| Insider Type | Human Example | Agent Equivalent |
|---|---|---|
| Negligent insider | Employee accidentally exposes data through carelessness | Agent drifting through accumulated context — not malicious, but degrading from policy through noise, stale memory, or unchecked context growth |
| Compromised insider | Employee whose credentials are stolen by an external attacker | Agent that's been prompt-injected, memory-poisoned, or whose NHI credentials are being used from an unexpected context |
| Malicious insider | Employee deliberately exfiltrating data | Agent with misaligned objectives — a sleeper agent, a training-time backdoor, or an agent that has been deliberately reprogrammed through persistent manipulation |
The detection challenge is the same in all three cases: the entity has legitimate access. The activity looks authorised. The anomaly is behavioural, not structural.
UEBA indicators that apply directly to agents¶
Insider risk programs monitor specific behavioural dimensions. Each translates to agent monitoring:
| UEBA Indicator (Humans) | Agent Equivalent | What It Catches |
|---|---|---|
| Unusual working hours — logins at 3am, weekend activity | Agent activity at unusual times — processing requests when no users or scheduled triggers should be activating it | Compromised agent being used out-of-band; unauthorised automation; external attacker operating in a different timezone |
| Access beyond role — accessing data irrelevant to job function | Agent accessing tools, data sources, or APIs outside its declared scope | Scope creep through prompt injection; memory-driven misrouting; tool-chain exploitation |
| Privilege escalation — requesting elevated permissions | Agent requesting broader tool access, higher-tier APIs, or cross-tenant data | Prompt injection attempting to widen blast radius; delegated authority abuse |
| Data volume anomalies — bulk downloads, unusual transfer volumes | Agent processing or transmitting significantly more data than baseline | Data exfiltration; RAG over-retrieval; runaway loops |
| New communication destinations — emails to unknown external addresses | Agent calling new external APIs, unknown endpoints, or previously unused tools | Exfiltration via tool invocation; supply chain compromise through redirected API calls |
| Peer group deviation — behaving differently from colleagues in the same role | Agent behaving differently from other agents with the same role and configuration | Individual agent compromise when peers remain normal; configuration drift; selective poisoning |
| Session anomalies — unusual duration, frequency, concurrency | Unusual session length, invocation frequency, or concurrent execution patterns | Agent being driven by an attacker with different usage patterns than legitimate users |
| Behavioural change after events — behaviour shift correlating with known events | Agent behaviour shift correlating with model updates, config changes, or memory additions | Regression from model update; memory poisoning taking effect; configuration tampering |
The temporal dimension deserves emphasis. Your example — the agent starts working at weekends when it did not before — is a classic UEBA indicator. In human insider risk, unusual working hours are one of the strongest early signals of compromised credentials or malicious intent. For agents, the same logic applies: if an agent's activity pattern changes without a corresponding change in its triggering conditions, something has changed about who or what is driving it.
Peer group comparison: the signal UEBA adds that we don't have¶
The most powerful UEBA technique that's missing from the current framework is peer group comparison. In human insider risk, an individual's behaviour is compared not just against their own baseline but against the baseline of their peer group — people in the same role, department, and access tier.
For agents, this means: if you run five customer service agents with the same configuration, and one starts behaving differently from the other four, that's a stronger signal than any individual-baseline deviation. The peer group filters out environmental changes (new data, seasonal patterns, updated prompts) that affect all agents equally, isolating the anomaly to the individual.
Peer comparison catches what individual baselines miss:
- A global shift in query patterns (new product launch, news event) moves all agents' baselines together — not anomalous
- A model update changes all agents' response patterns equally — not anomalous
- One agent's responses diverge while its peers remain stable — anomalous, investigate
Microsoft Sentinel's UEBA builds both individual entity profiles and peer group profiles specifically for this purpose. The same architecture applies to agent fleets.
What this means for the framework¶
The insider risk parallel is not a metaphor. It is a direct technical mapping. Agents are entities with identities, access privileges, behavioural baselines, and the potential for compromise. The 15+ years of UEBA engineering that enterprise security has invested in detecting compromised, negligent, and malicious humans transfers directly to detecting the same patterns in agents.
The practical implication: organisations that already run insider risk programs — with UEBA, SIEM correlation, and behavioural baselines — should extend those programs to cover agent identities. The agent's NHI should be enrolled in the same behavioural analytics pipeline as human user accounts. The same SIEM rules that flag "service account active at unusual hours" should flag "agent active at unusual hours."
This is not a new capability to build. It is an existing capability to extend.
Key Takeaways¶
-
Individual safety layers are necessary but not sufficient — aggregation reveals patterns invisible to any single layer
-
Behavioral baselines enable drift detection — you can't detect anomalies without knowing what's normal
-
Correlated failures are more significant — when multiple independent layers flag the same thing, pay attention
-
ML finds unknown-unknowns — clustering and anomaly detection surface attack patterns you didn't anticipate
-
This is observability, not a new system — extend existing monitoring to include behavioral signals
The question isn't just "did we catch the bad request?" It's "is the agent behaving the way we expect, across all the requests we can see?"
Related¶
- The Verification Gap — Why independent verification matters
- Judge Detects, Not Decides — Async evaluation for pattern analysis
- Current Solutions Reference — Platforms implementing this
- Beyond Security — How the framework's architecture applies to drift, fairness, and other AI risks beyond security
AI Runtime Behaviour Security, 2026 (Jonathan Gill).