Technique for Dynamic GPU Model Swapping
A new method for optimizing GPU inference costs involves dynamically swapping models between disk and GPU memory on-demand. This technique can reportedly reduce idle GPU memory by 40-60% in multi-tenant environments where numerous models are served from a fixed hardware pool. While the approach introduces some cold-start latency, it can be mitigated with predictive loading and warm-up queues managed via orchestrators like Kubernetes.
- This technique is an evolution of features found in established inference servers like the NVIDIA Triton Inference Server, which uses an "explicit" model control mode to allow models to be loaded and unloaded via API calls without restarting the server. - A key challenge is the trade-off between memory savings and latency; NVIDIA's implementation of a similar "hot-swapping" feature reports a time-to-first-token (TTFT) of 2-3 seconds, which is a 50-66x improvement over scaling from zero but still higher than a constantly warm model. - To combat the cold-start problem, specialized tools like the open-source NVIDIA Run:ai Model Streamer can be used to accelerate model loading by concurrently reading weights from storage and streaming them directly into GPU memory. - This approach of swapping entire models can be complemented by swapping smaller, task-specific LoRA adapters, which allows a single base model to remain on the GPU while lightweight adapters are dynamically loaded for different fine-tuned tasks. - The concept is similar to memory optimization in vLLM, which uses PagedAttention to manage the KV cache by swapping memory "pages" to CPU RAM, though PagedAttention focuses on managing the dynamic KV cache rather than the static model weights. - Snowflake has implemented a similar "model hotswapping" technique for its Cortex AI platform, allowing it to serve over 30 different models from a fixed pool of hardware by caching weights in CPU memory and on disk.