Datadog launches GPU monitoring tool
- Datadog said on April 22 that GPU Monitoring is now generally available, adding a product that tracks GPU health, utilization, memory and spend. - The company says GPU instances now represent about 14% of cloud compute costs, and the tool ties that spend to pods, jobs, teams. - The launch extends observability deeper into AI infrastructure as GPU demand and spending surge. (datadoghq.com)
Graphics processing units are the chips behind most modern artificial intelligence work, and they are far pricier than ordinary cloud processors. Datadog said on April 22 that its new GPU Monitoring product is now generally available. (datadoghq.com 1) (datadoghq.com 2) The product is built to show how busy each GPU is, how much memory it is using, whether hardware errors are building up, and what that usage is costing. Datadog says customers can view those signals across cloud, on-premises and “neocloud” environments. (datadoghq.com) (docs.datadoghq.com) Datadog says the software links device-level data to the work actually running on the chips, including Kubernetes pods, processes and Slurm jobs. That lets platform teams see which model, dataset, job or business team is consuming expensive capacity. (docs.datadoghq.com) (datadoghq.com) The pitch is about waste as much as uptime. Datadog says GPU Monitoring can flag idle devices, stalled workloads, allocation bottlenecks and failed jobs so teams can reclaim capacity instead of buying more chips. (datadoghq.com) (docs.datadoghq.com) Datadog framed the launch around rising artificial intelligence infrastructure bills. In its announcement, the company said GPU instances now account for about 14% of cloud compute costs and argued that share will keep growing as more training and inference workloads move into production. (markets.businessinsider.com) (datadoghq.com) That backdrop is getting bigger fast. International Data Corporation said worldwide spending on artificial intelligence infrastructure reached $89.9 billion in the fourth quarter of 2025, up 62% from a year earlier, with accelerated computing as the backbone of that buildout. (theregister.com) (theoutpost.ai) Datadog is not starting from zero on chip telemetry. The company already offered NVIDIA GPU integration and has been positioning GPU Monitoring alongside Infrastructure Monitoring, application performance monitoring, log management and large language model observability. (datadoghq.com 1) (datadoghq.com 2) What changed this week is packaging and scope. Datadog moved the product into broad availability and is selling a unified view that combines fleet health, workload performance and cost controls in one place. (datadoghq.com) (markets.businessinsider.com) The company’s message is that artificial intelligence teams no longer just need to know whether a server is up. They need to know whether a $30,000-class accelerator is busy, broken, misallocated or sitting idle while the bill keeps running. (datadoghq.com) (docs.datadoghq.com)