2. Why Current Tools Miss It¶
After this module you will be able to¶
- Evaluate your current observability stack against chain-level integrity requirements
- Identify specific gaps in LangSmith, LangFuse, and general-purpose APM tools for multi-agent monitoring
- Distinguish between per-agent observability and chain-level integrity monitoring
- Explain why a Phantom Compliance failure produces clean logs across every layer of a standard stack
Your observability stack today¶
Most engineering teams running AI systems in production have assembled an observability stack that looks something like this:
This is a solid stack. It will tell you when things are down, slow, or throwing errors. For a Phantom Compliance-style failure, every layer reports healthy. Let's walk through each one.
Layer by layer: what catches what¶
Structured logging¶
Your application logs capture request IDs, agent names, input/output summaries, timestamps, and error codes. Here is what the logs look like for a Phantom Compliance failure:
{"timestamp": "2025-03-15T14:22:01Z", "agent": "retrieval-agent-b",
"action": "vector_search", "query": "restricted_securities_check",
"results_returned": 47, "status": "success", "duration_ms": 340}
{"timestamp": "2025-03-15T14:22:02Z", "agent": "compliance-agent-b",
"action": "compliance_check", "input_docs": 47,
"result": "CLEAR", "confidence": 0.94, "status": "success",
"duration_ms": 1200}
{"timestamp": "2025-03-15T14:22:03Z", "agent": "approval-agent-c",
"action": "trade_approval", "compliance_status": "CLEAR",
"decision": "APPROVED", "status": "success", "duration_ms": 800}
Every log line shows "status": "success". The retrieval returned 47 results, a positive number. The compliance check returned CLEAR with high confidence. The trade was approved.
What's missing: There is no log entry that says "the complete restricted securities list has 312 entries and the agent only saw 47." The retrieval count is logged, but the expected count is not. Without the denominator, the numerator is meaningless.
Distributed tracing¶
Your traces show the full call graph:
Every span is green. No timeouts, no retries, no errors. The trace is structurally complete and shows a healthy request.
What's missing: The trace tells you that the vector search completed in 340ms and returned successfully. It does not tell you whether the results were complete. Tracing captures the shape of execution (what called what, how long it took) but not the semantic quality of data flowing between components.
LLM-specific metrics¶
Token counts, model versions, prompt templates, and cost per request are all normal. The LLM processed 47 documents and returned a response. Token usage was within expected bounds.
What's missing: Token usage tells you how much data the model processed. It does not tell you whether that data was sufficient for the task. A model that processes 47 documents and a model that processes 312 documents both return valid token counts.
AI-specific observability tools¶
LangSmith¶
LangSmith gives you detailed run traces for LangChain applications, including:
- Full prompt and completion text
- Intermediate chain steps
- Tool call inputs and outputs
- Latency breakdowns by step
- Feedback scores and human annotations
For the Phantom Compliance scenario, LangSmith would show you:
- The exact prompt sent to Agent B (including the 47 retrieved documents)
- Agent B's full response with the CLEAR determination
- Agent C's prompt (including Agent B's CLEAR response) and Agent C's approval
What LangSmith gives you: Full visibility into what each agent saw and said. If you manually inspected Agent B's trace, you could count the documents and notice only 47 were provided.
What LangSmith doesn't give you: An automated alert that 47 is insufficient. LangSmith traces individual runs and lets you inspect them. It does not automatically compare the retrieval count against an expected baseline. It does not track retrieval completeness as a metric. It does not correlate Agent B's data quality with Agent C's decision quality.
LangSmith is an inspection tool, not an integrity monitoring tool. It is excellent for debugging after you know something went wrong. It does not tell you something went wrong in the first place.
LangFuse¶
LangFuse provides similar tracing capabilities with an open-source model, plus:
- Prompt management and versioning
- Evaluation datasets and scoring
- Cost tracking by user, session, or trace
- Integration with evaluation frameworks
For Phantom Compliance, the analysis is the same as LangSmith. LangFuse records the trace faithfully. A human reviewing the trace could spot the issue. The tool does not surface the issue automatically.
LangFuse's scoring feature is closer to what you need: you can define custom scoring functions that evaluate traces. But the scoring functions you define are typically output-quality scores (is the response relevant, is it well-formatted, does it match a reference). Scoring functions that check input completeness require you to build a custom evaluator that knows what "complete" looks like for each retrieval step. This is possible but not a built-in capability, and most teams don't build it.
Other tools in the space¶
| Tool | What it gives you | What it doesn't give you |
|---|---|---|
| Weights & Biases (Weave) | Trace logging, evaluation tracking, dataset versioning | Chain-level integrity metrics; retrieval completeness alerts |
| Arize Phoenix | Embedding drift, retrieval metrics, LLM evaluations | Cross-agent consistency checks; end-to-end chain integrity |
| OpenLLMetry | OpenTelemetry-based LLM tracing | Semantic quality metrics; only captures structural telemetry |
| Helicone | Request logging, cost tracking, caching analytics | No chain-level view; per-request only |
The gap in every tool: Current AI observability tools give you per-agent or per-request visibility. They capture what happened at each step. They do not automatically evaluate whether what happened at each step was sufficient for the next step in the chain. The unit of analysis is the agent or the request. It needs to be the chain.
The three specific gaps¶
Gap 1: No retrieval completeness baseline¶
To know that 47 results is a problem, you need to know that 312 results is normal. No current observability tool maintains automatic baselines for retrieval result counts by query type. You would need to:
- Define expected result counts (or ranges) for each retrieval query pattern
- Emit the expected count alongside the actual count at retrieval time
- Alert when the ratio drops below a threshold
This is straightforward to build but requires per-query-type configuration. It is not a feature of any off-the-shelf tool today.
Gap 2: No context utilisation tracking¶
When an agent's context window is 80% full, is that normal? When it is 99% full and the framework silently truncated 1,300 tokens, is that logged?
Current tools track token counts for the LLM call but do not track:
- What percentage of the available context window was used
- Whether any input was truncated before being sent to the model
- Which parts of the input were truncated (system prompt? user context? tool results?)
Context utilisation is the memory pressure metric for LLM-based systems. We don't have the equivalent of a memory usage dashboard for context windows.
Gap 3: No cross-agent consistency scoring¶
The most critical gap: no tool automatically checks whether Agent C's decision is consistent with the quality of Agent B's inputs. This would require:
- Propagating data quality metadata (retrieval completeness, context utilisation, tool response freshness) across agent boundaries
- Evaluating downstream confidence against upstream data quality
- Alerting when an agent produces high-confidence output from low-quality inputs
This is chain-level integrity monitoring. It requires a fundamentally different architecture from per-agent observability, one that tracks the flow of data quality through the chain, not just the flow of data.
What clean logs look like for a broken chain¶
To make this concrete, here is a complete observability view of the Phantom Compliance failure as it would appear in a well-instrumented production system using today's best tools:
Grafana dashboard: All green. Request success rate 100%. P99 latency 2.8s (within SLA). Error rate 0%.
Datadog APM: Trace shows three agents executing in sequence. All spans complete. No retries. No errors.
LangSmith: Full trace available for inspection. Agent B received 47 documents, produced CLEAR. Agent C received CLEAR, produced APPROVED.
Prometheus alerts: None firing. Token usage normal. Cost per request normal. Model latency normal.
PagerDuty: Silent.
This is what a multi-agent failure looks like in production. It looks like everything is working.
The real-world consequence¶
Day 0: The failure occurs. Agent B checks 47 of 312 restricted securities. Trade is approved.
Day 1-14: The system continues to operate normally. Throughput metrics look good. Customer satisfaction scores are stable. The engineering team focuses on other priorities.
Day 15: A compliance officer performing a quarterly audit manually cross-references a sample of approved trades against the full restricted securities list. One trade doesn't match.
Day 16: The compliance officer escalates. The engineering team begins investigating.
Day 17: After reviewing LangSmith traces, an engineer notices the retrieval count of 47 and asks "is that all of them?" Nobody knows what the expected count should be.
Day 18: The team queries the vector store directly and discovers 312 matching documents. The failure is confirmed. A full audit of all trades processed in the last 18 days begins.
Total time to detection: 15 days (by a human, doing manual work that had nothing to do with the observability stack).
What you actually need¶
The modules that follow will give you the engineering concepts and frameworks to close these gaps. In preview:
- Module 3 introduces epistemic integrity as the engineering requirement that current tools don't address, and shows you the data structures you need
- Module 4 maps MASO controls to concrete implementation patterns you can build into your agent framework
- Module 5 gives you the specific metrics, dashboards, and tests that would have caught Phantom Compliance on Day 0
The gap is not in your tools' quality. LangSmith, LangFuse, Datadog, and the rest are good at what they do. The gap is in what they're designed to measure. They measure operational health. You need to measure reasoning integrity.
Reflection
Pull up your current AI observability dashboard (or imagine it in detail). For each metric on the dashboard, ask: "Would this metric change if my agent received 15% of the data it needed instead of 100%, but processed that 15% correctly?" If the answer is no, you have found a blind spot.
Consider
Most operational metrics (latency, error rate, throughput, token usage) would not change. The metrics that would change are ones you probably don't have yet: retrieval completeness ratio, context utilisation percentage, cross-agent confidence consistency. Module 5 shows you how to build them.