Datadog report: ops lag by 2 years
- Datadog said on April 21 its 2026 State of AI Engineering report found operational complexity, not model quality, is now blocking AI at scale. - The report says 69% of organizations use three or more models, while about 5% of AI requests fail and nearly 60% fail on capacity. - Datadog is framing AI observability as the next control layer for agent systems. (datadoghq.com)
Datadog said April 21 that the main problem in production AI is no longer model quality but operating the systems around it. (datadoghq.com) The company’s 2026 State of AI Engineering report is based on telemetry from more than 1,000 Datadog customers running AI in production. It says nearly seven in ten organizations now use three or more models. (datadoghq.com 1) (datadoghq.com 2) That shift means teams are not just calling one model from one app anymore. They are juggling OpenAI, Google Gemini, and Anthropic Claude across prompts, tools, retries, and multiple service boundaries. (datadoghq.com 1) (datadoghq.com 2) An AI agent is software that can take several steps, call tools, and move through a workflow on its own. Datadog separates those agent systems from simpler AI applications that make a single large language model call. (datadoghq.com) The report says those agent systems are getting harder to run as they spread. Agent framework adoption doubled year over year, adding more moving parts to production systems. (datadoghq.com) The operational strain shows up in failures and traffic size. Datadog said about 5% of AI model requests fail in production, and nearly 60% of those failures come from capacity limits. (datadoghq.com) Requests are also getting heavier. The average number of tokens sent to models more than doubled for median-use teams and quadrupled for heavy users, according to Datadog. (datadoghq.com) Provider share is shifting at the same time. Datadog said OpenAI still led with 63% share, while Google Gemini and Anthropic Claude gained 20 and 23 percentage points over the last year. (datadoghq.com) Datadog’s argument is that AI now resembles early cloud computing: the hard part is not only building features, but controlling cost, latency, routing, failures, and compliance across a distributed stack. Chief Product Officer Yanbing Li said companies that win will build “operational control” around models. (datadoghq.com) Vercel Chief Executive Guillermo Rauch made the same case in Datadog’s release. He said the next wave of agent failures will come from what teams cannot observe, not only from what agents cannot do. (datadoghq.com) The report does not support the specific claim that operations lag development by “roughly two years.” What it does show is a market moving toward multi-model, multi-step agent systems, with failures increasingly tied to capacity and system design. (datadoghq.com 1) (datadoghq.com 2) Datadog’s closing pitch is straightforward: as AI workloads start to look like distributed systems, observability tools want to become the dashboard for the agent era. (datadoghq.com)