Survey Compares Top LLM Serving Frameworks
A new technical survey benchmarks four leading LLM serving frameworks: vLLM, TensorRT-LLM, DeepSpeed-MII, and Ray Serve. The analysis highlights that vLLM leads in high-throughput batch serving, while NVIDIA's TensorRT-LLM excels at maximizing GPU throughput with advanced quantization. The choice of framework can reportedly improve throughput by up to 10x and cut serving costs by 50% compared to naive deployments.
- vLLM's core innovation, PagedAttention, treats the memory for attention keys and values like virtual memory in an operating system. This allows for non-contiguous storage of the KV cache in memory, which significantly reduces internal fragmentation and wasted memory—often by 60-80%—enabling larger batch sizes and higher throughput. - TensorRT-LLM compiles models into highly optimized "engine" files specific to the GPU architecture, applying techniques like layer and tensor fusion to merge multiple operations into single CUDA kernels. This ahead-of-time compilation minimizes memory read/write overhead and leverages specialized Tensor Cores for lower precision formats like FP8 and INT4, pushing hardware utilization to its limits for peak performance. - DeepSpeed-MII utilizes a technique called Dynamic SplitFuse, which separates the processing of user prompts (prefill) from the generation of new tokens (decode). This allows it to more effectively batch and schedule requests, reportedly delivering up to 2.3x higher throughput and 2x lower latency compared to systems like vLLM, especially for workloads with long prompts. - Ray Serve is designed for more than just single-model inference; it excels at building complex, multi-model applications and inference graphs. It allows you to compose multiple models and business logic in pure Python and can scale each component independently across a cluster, even assigning fractional GPUs to different models to maximize resource utilization. - The process of LLM inference is divided into two distinct phases: a compute-bound "prefill" stage that processes the input prompt in parallel, and a memory-bound "decode" stage that generates output tokens one by one. Optimizing serving frameworks involves managing the trade-offs between these two phases, particularly the growing size of the KV cache during decoding, which is the primary consumer of GPU memory. - While API-based services like OpenAI charge per token, self-hosting costs are driven by GPU instance uptime. For smaller models (under 30B parameters), self-hosting can be significantly more cost-effective. A single A100 80GB GPU can cost around $3-5 per hour, sufficient for models in the 7B-13B parameter range. - Quantization is a key optimization technique that reduces the numerical precision of model weights (e.g., from 16-bit to 8-bit or 4-bit integers), which shrinks the model's memory footprint and can double the speed of operations on hardware like NVIDIA's H100 GPUs. TensorRT-LLM has strong support for various quantization methods, including FP8, which maintains high model accuracy. - Ray Serve is built on the Ray distributed computing framework, allowing it to scale from a local laptop to a large Kubernetes cluster without code changes. This flexibility is useful for prototyping locally and then deploying to production at scale for enterprise workloads.