GPU sharing and inference tactics

- Engineers are advising GPU sharing on single nodes using NVIDIA’s device plugin to run multiple models before cluster scaling. - Tradeoffs between MPS, MIG and time‑slicing were highlighted for isolation, density and latency in multi‑tenant inference setups. - Open‑source inference OS designs and smarter routing promise throughput and cost gains for video AI workloads (x.com) (x.com) (x.com).

A graphics processing unit, or GPU, can be split up like a building with shared rooms, locked suites, or timed shifts — and engineers are telling teams to do that before buying more servers. (docs.nvidia.com) NVIDIA’s Kubernetes device plugin exposes GPUs to containers, and its sharing options let operators oversubscribe a single card so multiple workloads can run on one node. The same plugin supports shared access with CUDA time-slicing and CUDA Multi-Process Service, or MPS. (github.com) Time-slicing is the simplest version: jobs take turns on the whole GPU. NVIDIA’s GPU Operator says time-slicing lets workloads on oversubscribed GPUs “interleave with one another,” but it does not create memory or fault isolation between tenants. (docs.nvidia.com) Multi-Instance GPU, or MIG, is the locked-suite version. NVIDIA’s user guide says MIG partitions supported GPUs into separate instances with dedicated compute and memory resources, which gives each slice guaranteed performance. (docs.nvidia.com) MPS sits between those two. NVIDIA says Multi-Process Service is designed to improve utilization by reducing context-switch overhead and allowing work from multiple processes to run concurrently on the GPU. (docs.nvidia.com) Those tradeoffs map cleanly to multi-tenant inference, where one cluster serves several models or customers at once. Time-slicing can raise density on older or non-MIG hardware, MIG adds stronger isolation, and MPS can improve utilization when workloads are compatible enough to share a device. (docs.nvidia.com 1) (docs.nvidia.com 2) (docs.nvidia.com 3) The argument for staying on one node longer is partly economic. A single node avoids the networking and coordination overhead that shows up when teams spread inference across more machines too early, while the device plugin and GPU Operator already provide the control plane for packing more work onto each card. (docs.nvidia.com) (github.com) A second layer of optimization is routing, or deciding which request goes to which worker. Baseten said this month that cache-aware routing with NVIDIA Dynamo often delivers about 2x faster time-to-first-token in production by sending requests to replicas that already hold useful key-value cache state. (baseten.co) Other serving stacks are pushing the same idea. Anyscale said custom routing in Ray Serve cut latency by 60% for some large language model and mixture-of-experts workloads, and BentoML has described prefix-aware and key-value-cache-aware load balancing as ways to avoid wasted compute in distributed inference. (anyscale.com) (bentoml.com) That matters for video and multimodal systems because they often mix bursty traffic, large models, and expensive accelerators. The emerging playbook is to pack more models onto each node first, choose sharing mode based on isolation and latency needs, and then use smarter routing so the cluster spends fewer cycles redoing work it already has in memory. (docs.nvidia.com 1) (docs.nvidia.com 2) (baseten.co)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.