Datadog ties telemetry to LLMs
- Datadog’s May 7 earnings update put AI observability in focus as the company pitched LLM Observability for tracing prompts, tokens, retrieval steps, and tool calls. - The product now goes past uptime metrics into quality scoring — Datadog says teams can run managed or custom evaluations on traces and agent workflows. - That matters because AI apps break semantically, not just technically, and investors rewarded Datadog after it raised Q2 and full-year guidance.
AI observability is becoming its own category — and Datadog is trying to make sure it looks less like a niche add-on and more like the next layer of core monitoring. The basic pitch is simple. Traditional APM tells you whether a service was up, slow, or throwing errors. But LLM apps fail in weirder ways. They can be fast, technically healthy, and still give the wrong answer, call the wrong tool, leak sensitive data, or burn through tokens. Datadog’s recent push is about making those failures visible inside the same monitoring stack engineers already use. ### What changed this week? The immediate news was financial. On May 7, Datadog reported first-quarter 2026 results and raised guidance for the second quarter and full year. MarketBeat’s summary of the update shows Q2 adjusted EPS guidance of $0.57 to $0.59 versus a $0.41 consensus estimate, plus revenue guidance around $1.1 billion versus roughly $992 million expected. The stock jumped sharply after that, and Bloomberg said the move was Datadog’s biggest one-day surge in more than six years. (datadoghq.com) ### Why does AI observability need its own tooling? Because LLM systems are not just another API call. A single user request can fan out into prompt construction, retrieval, multiple model calls, agent decisions, and external tool execution. Datadog’s LLM Observability product is built to trace those steps directly — prompts, model responses, retrieval steps, tool calls, retries, latency, token usage, and errors — so an engineer can see where an agent actually went off the rails. That is a different job from watching CPU, p95 latency, or a generic request trace. (marketbeat.com) ### What’s the missing signal in normal APM? Basically, semantics and cost. A normal trace can tell you a request completed in 800 milliseconds. It usually cannot tell you whether the model ignored instructions, hallucinated, picked the wrong tool, or spent 5x more tokens than yesterday. Datadog has been leaning into that gap by surfacing prompt tracking, token usage, and agent-step visibility as first-class telemetry. Its docs also show OpenTelemetry-based instrumentation paths, which matters because teams do not want to rebuild their whole stack just to watch AI workloads. (datadoghq.com) ### Why are evaluations such a big deal? Because “the app returned 200 OK” is useless if the answer was bad. Datadog now treats quality as something you can score and monitor, not just eyeball in a playground. Its LLM Observability docs describe managed evaluations, custom LLM-as-a-judge evaluations, and external evaluation ingestion. In plain English, that means teams can attach quality checks to real traces and compare prompts, models, or agent versions with something more rigorous than vibes. (datadoghq.com) ### What does that look like in practice? Datadog’s own teams are using it internally. One recent post walks through AI Guard, a system that inspects prompts, outputs, and tool calls to block unsafe behavior in Bits AI agents. Another shows the Graphing AI team using LLM Observability and Experiments to debug agent behavior and measure semantic and functional accuracy across model versions. That is the tell here — Datadog is not selling only “monitor your model latency.” It is selling a workflow for testing, evaluating, and operating agents in production. (docs.datadoghq.com) ### Why does Wall Street care? Because this is how Datadog keeps expanding beyond infrastructure monitoring without leaving its lane. AI workloads create new spend on GPUs, inference, and engineering tools, but they also create a fresh observability problem that looks adjacent to Datadog’s core business. If customers decide AI systems need model-aware traces and quality telemetry inside their existing ops platform, Datadog gets to deepen its role instead of watching point solutions take that budget. (datadoghq.com) The raised guidance suggests investors think that expansion story is getting more credible. ### So what’s the real takeaway? The shift is from “is the service healthy?” to “did the agent do the right thing, at the right cost, with the right tools?” Datadog is betting that those questions belong in observability, not in a separate AI sidecar. If that bet holds, LLM telemetry stops being exotic developer plumbing and becomes standard production infrastructure. (datadoghq.com 1) (datadoghq.com 2)