Datadog: AI engineering is ops

- Datadog said on April 21 its 2026 State of AI Engineering report found production AI is now constrained more by operations than models. - The report says 69% of organizations use three or more models, while about 5% of model requests fail in production. - The shift puts AI work into cloud-style reliability, routing and cost management problems. (datadoghq.com)

Datadog said on April 21 that production AI is now running into operational limits, with reliability and capacity issues overtaking model choice. (investors.datadoghq.com) The company’s 2026 State of AI Engineering report draws on telemetry from more than 1,000 Datadog customers and focuses on organizations already running AI in production. (datadoghq.com) Datadog found that 69% of organizations now use three or more models, and the share using more than six models nearly doubled over the last year. OpenAI still led with 63% share, while Google Gemini and Anthropic Claude each gained ground. (datadoghq.com) (investors.datadoghq.com) The report says about 5% of AI model requests fail in production, and nearly 60% of those failures come from capacity limits rather than bad prompts or weak models. (investors.datadoghq.com) That changes the engineering problem. Teams are no longer shipping one model call inside one app; they are managing model fleets, orchestration frameworks, tool calls, retries, long prompts and multiple service boundaries. (datadoghq.com) Datadog describes the work in familiar cloud terms: routing, life-cycle management, capacity planning, cost control and debugging across distributed systems. The difference is that changing a model, prompt or retrieval step can move latency, spend and failure rates without an obvious code change. (datadoghq.com) The company also said agent framework adoption doubled year over year, adding more moving parts to production systems as teams push beyond demos into multi-step workflows. (investors.datadoghq.com) Input size is rising too. Datadog said the average number of tokens sent per request more than doubled for median-use teams and quadrupled for heavy users, which raises both latency and spend pressure. (investors.datadoghq.com) Datadog has been building products around that shift, including LLM Observability for tracing and evaluation, Cloud Cost Management for token- and model-level spending, and GPU Monitoring for infrastructure usage. (datadoghq.com) In a June 2025 product post, Datadog described agent systems as dynamic decision graphs that can branch, retry, hand off work and merge results, making them harder to trace than fixed workflows. (datadoghq.com) Yanbing Li, Datadog’s chief product officer, said AI now resembles “the early days of cloud,” with the winning companies building operational control around the models they use. (investors.datadoghq.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.