Rack-scale and fractional GPU allocation
NIO scaled 600 GPUs for autonomous workloads using HAMi to fractionally allocate resources, boosting CI utilization from single digits to roughly 30–50%. NVIDIA also published rack-scale guidance covering NVLink partitioning, topology-aware scheduling and Slurm/Kubernetes integrations for multi-node AI workloads. (x.com, x.com)
Companies are starting to treat graphics processors less like single machines and more like shared infrastructure spread across a rack. NIO said it used HAMi software to split and schedule 600 graphics processors for autonomous-driving workloads, while NVIDIA published new guidance for running jobs across rack-scale systems. (cncf.io, developer.nvidia.com) A graphics processor, or GPU, is usually booked whole even when a job needs only part of its memory or compute. HAMi, a Cloud Native Computing Foundation project formerly called k8s-vGPU-scheduler, lets Kubernetes carve up heterogeneous accelerators and place workloads by device topology instead of handing one full chip to one task. (github.com, www.cncf.io) NIO’s case study says the carmaker runs model training, simulation, continuous integration testing, and online inference on a large internal cloud for autonomous driving. The company said its HAMi deployment spans 600 GPUs across 80 nodes. (cncf.io, project-hami.io) The company said continuous-integration GPU utilization rose from single digits to about 30% to 50% after it adopted fractional allocation, and simulation workloads used 30% fewer GPU hours. HAMi’s case-study page describes the approach as a hybrid of GPU sharing, Multi-Instance GPU partitioning, and time-slicing. (cncf.io, project-hami.io) NVIDIA’s guidance addresses a different bottleneck: what happens when dozens of GPUs are tied together inside one rack with high-speed links. Its April 2026 developer post says GB200 NVL72 and GB300 NVL72 systems use NVLink switches, support multi-node NVLink inside the rack, and expose partitions that schedulers need to understand before placing jobs. (developer.nvidia.com) In plain terms, topology-aware scheduling means keeping a job on GPUs that are physically well connected, instead of scattering it across slower paths. NVIDIA said its Mission Control software maps cluster identifiers and clique identifiers to NVLink domains and partitions, then feeds that information into Slurm and NVIDIA Run:ai for placement and isolation. (developer.nvidia.com, hpcwire.com) Those two threads meet in the same operational problem: many artificial-intelligence jobs do not need an entire GPU, while the biggest training runs need several GPUs that sit close together. NVIDIA’s Run:ai documentation separately supports GPU fractions for quota planning, showing that vendors are now exposing both smaller-than-a-GPU and rack-sized allocation models in the same stack. (run-ai-docs.nvidia.com, developer.nvidia.com) The scheduling layer is becoming the control point between those extremes. NIO’s case study focuses on reclaiming idle capacity in software-development and simulation pipelines, while NVIDIA’s post focuses on preserving bandwidth and memory-sharing behavior in Blackwell rack systems tied together by NVLink. (cncf.io, developer.nvidia.com) The result is a more granular market for compute: fractions of a chip for smaller jobs, and topology-defined groups of chips for larger ones. That is where current GPU infrastructure is heading, from Kubernetes clusters running car-software tests to rack-scale systems built for multi-node artificial-intelligence training. (project-hami.io, developer.nvidia.com)