Advanced Kubernetes Scheduling for GPUs Gains Traction

Kubernetes is the default platform for ML workloads, with a growing focus on advanced scheduling to maximize GPU utilization. Teams are using custom resource definitions (CRDs), operators, and topology-aware scheduling to improve efficiency and reliability. Some R&D teams are even experimenting with giving each AI agent its own control plane to negotiate for GPU time for bursty workloads.

- Native Kubernetes only allows whole GPU instances to be assigned to pods, which is inefficient for workloads that only need a fraction of a GPU's capacity. Technologies like NVIDIA's Multi-Instance GPU (MIG) and time-slicing address this by partitioning a physical GPU into smaller, isolated instances or by sharing compute time between processes. - Topology-aware scheduling is critical for multi-GPU training jobs, as the default Kubernetes scheduler is unaware of the physical layout of GPUs. Placing pods on GPUs connected by high-speed interconnects like NVLink, instead of across slower PCIe links, can improve communication performance by up to 40%. - The NVIDIA GPU Operator for Kubernetes automates the management of drivers, container runtimes, and monitoring. It utilizes Custom Resource Definitions (CRDs), such as `ClusterPolicy`, to manage the lifecycle of all necessary NVIDIA software components. - For monitoring, the NVIDIA Data Center GPU Manager (DCGM) integrates with tools like Prometheus to export over 100 GPU-specific metrics, including SM utilization, memory bandwidth, and power draw, providing crucial visibility for cost allocation and performance tuning. - Time-based fair-share scheduling offers a more equitable distribution of GPU resources over time, preventing teams with bursty, large jobs from being perpetually starved by smaller, more frequent jobs. This approach tracks historical usage to ensure all teams receive their proportional share of compute resources. - While vLLM is recognized for its high-throughput inference serving and integration with the Hugging Face ecosystem, TensorRT-LLM is often chosen for achieving maximum performance on NVIDIA hardware through deep optimizations. Projects like llm-d are emerging to provide Kubernetes-native distributed serving stacks on top of inference engines like vLLM. - NVIDIA's Multi-Process Service (MPS) allows multiple CUDA applications to run concurrently on the same GPU, which can be beneficial for sharing resources. However, unlike MIG, it does not provide strong memory and fault isolation between the processes. - For workloads with intermittent or bursty GPU needs, such as interactive development in Jupyter notebooks or many inference tasks, dedicating a full GPU is highly inefficient and leads to significant underutilization. GPU sharing techniques are essential for improving cost-effectiveness in these scenarios.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.