Datadog launches GPU monitoring
What happened
- Datadog released new GPU monitoring tools to help businesses cut costs and boost AI performance. - The tooling focuses on visibility into GPU utilization for training and inference workloads. - Infra teams can pair this telemetry with scheduling and vGPU strategies to better right‑size fleets, per ITWire coverage. (itwire.com)
Why it matters
Datadog said on April 22 that its new GPU Monitoring product is now generally available, giving customers a way to track how expensive AI chips are actually being used. (markets.businessinsider.com) Graphics processing units, or GPUs, are the chips that do most of the heavy lifting for training models and serving AI responses. Datadog’s product is built to show utilization, memory, thermals, power, and error data, then tie those signals back to the pods, jobs, models, datasets, and teams using them. (datadoghq.com) The company said the software is aimed at two common AI jobs: training, which teaches a model from data, and inference, which is the live step where a model answers a prompt or makes a prediction. Datadog said stalled workloads can be traced to the underlying GPUs, pods, and processes so engineers can find bottlenecks in minutes instead of hours. (itwire.com) Datadog put a number on the cost pressure behind the launch: GPU instances now make up 14% of compute costs for organizations that use them, up from 10% a year earlier in its cloud-cost analysis. In the launch announcement, Chief Product Officer Yanbing Li said that makes budgeting harder when companies cannot see idle devices, workload context, or charge spending back to business units. (datadoghq.com) The product arrives as companies run more AI systems in production and hit more operational limits. Datadog said on April 21 that 69% of companies using AI in production now use three or more models, and about 5% of AI model requests fail in production, with capacity limits a leading bottleneck. (markets.businessinsider.com) Datadog’s pitch is that better visibility can reduce overbuying. The company said the product adds usage forecasting and guidance on whether teams should buy more GPUs or free up existing capacity, while also flagging unhealthy devices before failures delay training runs. (markets.businessinsider.com) The setup is not plug-and-play in every environment. Datadog’s reference architecture says Kubernetes deployments need NVIDIA’s device plugin or GPU Operator installed, and some finer-grained metrics require Linux, a modern kernel, and privileged system-probe access. (datadoghq.com) Datadog has offered NVIDIA GPU metrics before through an integration with NVIDIA Data Center GPU Manager Exporter, but the new release packages fleet health, cost, and workload context into one product the company says is available to all customers. (datadoghq.com) The immediate test is whether infrastructure teams use that visibility to shrink idle capacity instead of adding more chips. In an AI market where GPU budgets have become a board-level line item, Datadog is selling a dashboard as a cost-control tool. (itwire.com)
Key numbers
- (itwire.com) Datadog said on April 22 that its new GPU Monitoring product is now generally available, giving customers a way to track how expensive AI chips are actually being used.
- (itwire.com) Datadog put a number on the cost pressure behind the launch: GPU instances now make up 14% of compute costs for organizations that use them, up from 10% a year earlier in its cloud-cost analysis.
- Datadog said on April 21 that 69% of companies using AI in production now use three or more models, and about 5% of AI model requests fail in production, with capacity limits a leading bottleneck.
What happens next
- (itwire.com) Datadog put a number on the cost pressure behind the launch: GPU instances now make up 14% of compute costs for organizations that use them, up from 10% a year earlier in its cloud-cost analysis.
- In the launch announcement, Chief Product Officer Yanbing Li said that makes budgeting harder when companies cannot see idle devices, workload context, or charge spending back to business units.
Quick answers
What happened in Datadog launches GPU monitoring?
Datadog released new GPU monitoring tools to help businesses cut costs and boost AI performance. The tooling focuses on visibility into GPU utilization for training and inference workloads. Infra teams can pair this telemetry with scheduling and vGPU strategies to better right‑size fleets, per ITWire coverage. (itwire.com)
Why does Datadog launches GPU monitoring matter?
Datadog said on April 22 that its new GPU Monitoring product is now generally available, giving customers a way to track how expensive AI chips are actually being used. (markets.businessinsider.com) Graphics processing units, or GPUs, are the chips that do most of the heavy lifting for training models and serving AI responses. Datadog’s product is built to show utilization, memory, thermals, power, and error data, then tie those signals back to the pods, jobs, models, datasets, and teams using them. (datadoghq.com) The company said the software is aimed at two common AI jobs: training, which teaches a model from data, and inference, which is the live step where a model answers a prompt or makes a prediction. Datadog said stalled workloads can be traced to the underlying GPUs, pods, and processes so engineers can find bottlenecks in minutes instead of hours. (itwire.com) Datadog put a number on the cost pressure behind the launch: GPU instances now make up 14% of compute costs for organizations that use them, up from 10% a year earlier in its cloud-cost analysis. In the launch announcement, Chief Product Officer Yanbing Li said that makes budgeting harder when companies cannot see idle devices, workload context, or charge spending back to business units. (datadoghq.com) The product arrives as companies run more AI systems in production and hit more operational limits. Datadog said on April 21 that 69% of companies using AI in production now use three or more models, and about 5% of AI model requests fail in production, with capacity limits a leading bottleneck. (markets.businessinsider.com) Datadog’s pitch is that better visibility can reduce overbuying. The company said the product adds usage forecasting and guidance on whether teams should buy more GPUs or free up existing capacity, while also flagging unhealthy devices before failures delay training runs. (markets.businessinsider.com) The setup is not plug-and-play in every environment. Datadog’s reference architecture says Kubernetes deployments need NVIDIA’s device plugin or GPU Operator installed, and some finer-grained metrics require Linux, a modern kernel, and privileged system-probe access. (datadoghq.com) Datadog has offered NVIDIA GPU metrics before through an integration with NVIDIA Data Center GPU Manager Exporter, but the new release packages fleet health, cost, and workload context into one product the company says is available to all customers. (datadoghq.com) The immediate test is whether infrastructure teams use that visibility to shrink idle capacity instead of adding more chips. In an AI market where GPU budgets have become a board-level line item, Datadog is selling a dashboard as a cost-control tool. (itwire.com)