Locked billions in idle GPUs
- Cast AI’s April 21 report and a May 5 Data Center Knowledge write-up put a number on a weird AI reality: enterprises are buying GPUs they barely use. - The headline figure is brutal: average GPU utilization across non-optimized Kubernetes clusters was 5%, with CPU at 8% and memory at 23%. - That matters because cloud and data-center builders are still racing to add AI capacity, even while a lot of already-rented compute sits stranded.
GPU scarcity is real. But so is GPU waste. That’s the awkward point behind a new Cast AI report that landed in late April and got fresh attention on May 5: enterprises are acting like accelerators are impossibly precious, then running their Kubernetes clusters at laughably low utilization. The result is not a tiny efficiency miss. It’s a giant pool of paid-for compute sitting idle while everyone talks about shortages. ### What actually got measured? Cast AI looked at tens of thousands of non-optimized Kubernetes clusters running across AWS, Azure, and Google Cloud. This was not a survey about intentions or budgets. It was operational data about how much of the provisioned infrastructure was really being used. The averages were stark: GPU utilization at 5%, CPU at 8%, and memory at 23%. ### Why is 5% such a big deal? Because GPUs are the expensive part. If a normal server sits underused, that hurts. If an H100-class box or rented GPU node sits mostly idle, the burn rate gets ugly fast. A 5% average means companies are often reserving accelerator capacity for possible work rather than keeping it busy with actual work. Basically, they are paying scarcity prices for standby inventory. ### Why would anyone do that? FOMO, mostly. Teams worry that if they give up GPU capacity now, they won’t get it back when a model training run, fine-tuning job, or internal AI launch suddenly needs it. So they over-request, hold onto nodes, and build buffers inside clusters. Turns out the same shortage psychology that pushes procurement also discourages sharing. Idle capacity becomes a kind of insurance policy. ### Why does Kubernetes make this worse? Kubernetes is good at orchestrating containers, but it does not magically fix bad resource requests. If teams ask for whole GPUs, oversized CPU reservations, or memory they rarely touch, the scheduler has to honor that. Then you get fragmentation — little unusable gaps scattered on paper while staying mostly empty in practice. ### Isn’t the market supposed to solve this? Not quickly. The weird thing is that low enterprise utilization can coexist with genuine capacity pressure at the hyperscaler level. Microsoft, for example, is still dealing with AI demand that outpaces the physical rollout of power, cooling, and data-center capacity. So both things can be true at once — frontier customers scheduled. ### So what unlocks the stranded compute? The boring answers are the important ones: bin-packing, rightsizing, autoscaling tied to real utilization, preemption for lower-priority jobs, and quotas that stop one team from camping on scarce GPUs. Some platforms also push spot instances and smarter scheduling so short jobs can fill the holes. None of this is glamorous. But it is how you turn “allocated” into “used.” ### Why does this matter beyond cloud bills? Because every idle GPU distorts the bigger AI buildout story. Companies keep ordering hardware, cloud providers keep expanding capacity, and utilities keep planning for heavier data-center loads. If a meaningful chunk of demand is really waste dressed up as demand, then the industry is solving two problems at once: a real shortage at the frontier and a self-inflicted utilization mess in the enterprise. ### Bottom line? The headline is not “there is no GPU shortage.” The headline is that a lot of enterprises are treating GPUs like rare collectibles instead of productive machines. Until they schedule them better, billions in compute will stay locked — and the AI infrastructure race will look tighter than it really is.