AWS Enables Multi-Model Serving with vLLM
Amazon Web Services has integrated the vLLM inference engine with its SageMaker and Bedrock platforms. The integration allows organizations to serve dozens of fine-tuned or Mixture-of-Experts (MoE) models on a single endpoint, reducing costs associated with idle GPU time for multi-model deployments.
- The core technology enabling vLLM's efficiency is PagedAttention, which manages the memory of the key-value (KV) cache by partitioning it into non-contiguous blocks, similar to how an operating system uses paging for virtual memory. This approach drastically reduces memory fragmentation and waste, allowing for higher batch sizes and throughput. - vLLM employs a continuous batching mechanism, where the server processes new requests at each iteration step rather than waiting for an entire batch to complete. This keeps the GPU consistently utilized and can improve throughput by 2 to 10 times compared to traditional static batching. - In benchmark comparisons against other inference solutions like Ollama, vLLM has demonstrated significantly higher throughput, peaking at 793 tokens per second versus Ollama's 41 on the same hardware in one test. Another study found vLLM could achieve up to 24 times higher throughput than Hugging Face's Text Generation Inference (TGI) under high concurrency. - The open-source vLLM project originated at UC Berkeley's Sky Computing Lab and is now part of the Linux Foundation, with contributions from entities like Meta, Red Hat, Hugging Face, NVIDIA, and Google. The startup commercializing the technology is reportedly raising over $160 million at a potential valuation of around $1 billion. - While vLLM is optimized for high-throughput on GPUs, competitors like NVIDIA's TensorRT-LLM are designed for maximum performance specifically on NVIDIA hardware by using fused CUDA kernels and deep graph optimizations. The choice between them often involves a trade-off between vLLM's flexibility with the Hugging Face ecosystem and TensorRT-LLM's peak performance within the NVIDIA stack. - The integration addresses the challenges of serving Mixture-of-Experts (MoE) models, which have a large total parameter count but only activate a subset of "experts" for any given input. While computationally efficient, MoE inference requires high memory capacity to hold all expert parameters, a challenge that vLLM's memory management helps address. - This move by AWS is part of a broader "build vs. buy" strategy among hyperscalers to control the AI stack and reduce dependency on third-party chipmakers. AWS designs its own custom silicon, such as Inferentia and Trainium chips, to offer lower-cost AI training and inference compared to general-purpose GPUs. - Serving multiple fine-tuned models typically incurs high memory overhead, as each model requires its own memory space. By using techniques like PagedAttention and quantization, vLLM allows multiple models or LoRA adapters to be served from a single base model, significantly reducing the memory footprint and cost per model.