GPU sharing and K8s resizing debate

- Engineers are weighing GPU multi‑tenant options like NVIDIA MPS, MIG and time‑slicing to trade isolation for cluster density. - Darryl Ruggles outlined MPS tradeoffs, and Kubernetes VPA InPlaceOrRecreate promises dynamic resource resizing without pod restarts. - Those techniques push ML platform decisions toward runtime flexibility to improve utilization and manage compute cost pressure ( ).

A single Nvidia GPU can now be carved up more ways inside Kubernetes, and the tradeoff is increasingly between stronger isolation and higher utilization. (docs.nvidia.com) The simplest model is time-slicing: Kubernetes can oversubscribe one physical GPU so multiple pods take turns on it. Nvidia says those pods interleave on the same card, but they do not get memory or fault isolation from one another. (docs.nvidia.com) A stricter model is Multi-Instance GPU, or MIG, which splits supported Nvidia chips such as the A100 into smaller hardware-backed instances. Nvidia says those instances are separate and secure, with memory and fault isolation at the hardware layer. (docs.nvidia.com) Another option is Multi-Process Service, or MPS, which Nvidia describes as a binary-compatible CUDA runtime for cooperative multi-process applications. In practice, engineers use it to let several CUDA processes share one GPU more efficiently, but it does not create the hard partitions that MIG does. (docs.nvidia.com) That menu of choices is getting more attention as teams try to stop treating every machine-learning job as if it needs an entire accelerator. Nvidia’s Kubernetes docs say time-slicing can also be combined with MIG, giving operators a way to share even the smaller GPU slices. (docs.nvidia.com) The Kubernetes side of the debate has moved in parallel. The project’s Vertical Pod Autoscaler can now use an “InPlaceOrRecreate” mode that first tries to change a pod’s CPU and memory requests and limits without restarting it, and falls back to replacement only when needed. (kubernetes.io) That capability rests on in-place pod resize, which Kubernetes promoted to beta in v1.33 in May 2025 and documents as stable in v1.35. Kubernetes says the feature lets operators change CPU and memory on a running pod, often without the disruption of deleting and recreating it. (kubernetes.io 1) (kubernetes.io 2) The result is a more flexible runtime stack: one set of controls decides how many workloads can share an expensive GPU, and another decides how much CPU and memory each workload keeps while it runs. Kubernetes says Vertical Pod Autoscaler is designed to adjust requests up or down based on historical usage, cluster capacity, and events such as out-of-memory conditions. (kubernetes.io) The operational costs are different for each path. Nvidia’s MIG docs say changing MIG mode or geometry may require draining workloads and, in some cloud setups, rebooting the node; time-slicing avoids hard partitioning but gives up isolation; MPS improves concurrency for cooperative CUDA jobs but depends on application behavior. (docs.nvidia.com 1) (docs.nvidia.com 2) (docs.nvidia.com 3) The practical question for platform teams is no longer whether to share GPUs at all. It is where to draw the line between density, performance predictability, and blast radius on clusters where accelerator time is still the most expensive resource. (docs.nvidia.com) (kubernetes.io)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.