Datadog finds ops lag in AI engineering

- Datadog said on April 21 its 2026 State of AI Engineering report found AI teams are hitting production bottlenecks as agent systems grow harder to run. - The report says 69% of organizations use three or more models, while about 5% of model requests fail in production, mostly from capacity limits. - Datadog is pitching observability as the fix as agent frameworks spread and token loads surge. (datadoghq.com)

Datadog said April 21 that AI teams are running into operational problems in production, even as companies ship more agents and use more models. (datadoghq.com) The company’s 2026 State of AI Engineering report is based on telemetry from more than 1,000 Datadog customers and thousands of organizations running AI in production. Datadog said 69% of organizations now use three or more models. (datadoghq.com 1) (datadoghq.com 2) About 5% of AI model requests fail in production, Datadog said, and nearly 60% of those failures come from capacity limits rather than model quality. The company said those failures show up as slow responses, errors, and broken user experiences. (datadoghq.com) An AI agent is software that can make multiple model calls, use tools, and move through several steps before it answers. Datadog defines agents as workloads with multi-step control flow, tool execution, or multiple service calls. (datadoghq.com) That matters because production AI is starting to resemble distributed cloud software, where routing, retries, capacity planning, and debugging all affect reliability. Datadog said model, prompt, or retrieval changes can shift latency, cost, and failure rates without any obvious code change. (datadoghq.com) The report says the gap between a good demo and a dependable production system is closed by evaluation and operational discipline. Datadog also said agent framework adoption doubled year over year, adding more moving parts to production systems. (datadoghq.com 1) (datadoghq.com 2) Datadog said OpenAI still had the largest provider share at 63%, but Google Gemini and Anthropic Claude gained 20 and 23 percentage points over the last year. It also said the number of customers using OpenAI more than doubled even as teams diversified across providers. (datadoghq.com) The amount of data sent to models is also rising. Datadog said average tokens per request more than doubled for median-use teams and quadrupled for heavy users at the 90th percentile. (datadoghq.com) Datadog has spent the last two years building products around that operating problem, including LLM Observability for tracing and quality checks, and AI Guard for real-time blocking of unsafe prompts, outputs, and tool calls. Its product posts say the tools combine traces, evaluation signals, and security monitoring around agent workflows. (datadoghq.com 1) (datadoghq.com 2) Datadog’s own engineering posts make the same case internally. Teams building Bits AI SRE and dashboard agents said they needed replayable evaluations, regression tracking, and trace data to see when one model or prompt change improved one task but made others worse. (datadoghq.com) (datadoghq.com) Yanbing Li, Datadog’s chief product officer, said AI observability is becoming as essential to the application layer as cloud observability was a decade ago. The report’s numbers point to the same conclusion: companies are no longer just building AI systems, they are operating them. (datadoghq.com)

Datadog finds ops lag in AI engineering

Get your own daily briefing