Kubernetes manages mixed GPU clusters
- Kubernetes teams are converging on a playbook for mixed GPU fleets: automate drivers with NVIDIA GPU Operator, match jobs with Dynamic Resource Allocation, and add fast autoscaling. - The concrete gains are large: Canonical described tuning 10,000 A100 GPUs across 20 on-prem clusters, while NIO reported 10× CI utilization gains and 30% fewer GPU hours. - The shift tracks Kubernetes making GPUs more programmable through DRA and open tooling, not just fixed whole-device scheduling. (ubuntu.com)
A GPU cluster is a parking lot full of different trucks. Kubernetes is learning to route each job to the truck that fits, instead of handing out keys to the biggest one. (youtube.com) (ubuntu.com) That matters because many Kubernetes setups still treat a graphics processor as an all-or-nothing device. NVIDIA’s GPU Operator automates the drivers, runtimes, and monitoring needed to make those machines usable inside a cluster. (ubuntu.com) The newer piece is Dynamic Resource Allocation, or DRA, a Kubernetes feature that lets a workload ask for a class of GPU instead of one exact card. Google engineers said a deployment can request “20GB+” and land on whatever compatible hardware is actually available. (youtube.com) That changes mixed fleets from a scheduling headache into a capacity pool. Older and newer GPUs can serve the same application when the software only needs a floor on memory or performance, not one specific model. (youtube.com) (cncf.io) Autoscaling is the second half of the pattern. Karpenter watches for pods that cannot be placed, launches the right nodes directly, and can choose among instance types and purchase options instead of waiting on fixed node groups. (cncf.io) For teams running private GPU clouds, tenancy is the other problem. vCluster’s AI platform pitches virtual clusters, private nodes, and Karpenter-based auto nodes so one physical GPU estate can be split into isolated environments for different users or projects. (vcluster.com) The payoff is mostly about utilization, not raw speed. PREP EDU said its heterogeneous RTX 4070 and RTX 4090 inference cluster had static allocation that left utilization at 10% to 20% before it moved to finer-grained GPU orchestration. (cncf.io) NIO described the same economics at larger scale. Its hybrid environment spans about 600 GPUs across roughly 80 nodes, and the company said a mixed strategy cut simulation GPU hours by 30% and lifted CI pipeline utilization by 10×. (cncf.io) Canonical’s KubeCon session put an even bigger number on the operational challenge: 10,000 A100 GPUs spread across 20 on-premises Kubernetes clusters. The company’s talk focused on sharing, scheduling, and multi-cluster controls needed to keep that hardware from sitting idle. (youtube.com) The standards are moving too. NVIDIA said at KubeCon Europe in March 2026 that it would donate its GPU DRA driver to the Cloud Native Computing Foundation, pushing GPU allocation further into Kubernetes’ common plumbing. (ubuntu.com) So the practical lesson is simple: mixed GPU clusters work best when Kubernetes stops thinking in whole cards and starts thinking in fit, isolation, and idle time. The software stack is becoming the traffic cop for some of the most expensive machines in the data center. (ubuntu.com) (cncf.io)