Stripe cuts inference costs 73% with vLLM

Stripe achieved a 73% reduction in its inference costs by deploying vLLM, an open-source serving library. The savings were attributed to vLLM's PagedAttention mechanism and advanced batching techniques, which significantly improve GPU utilization for large language model inference.

- The core innovation, PagedAttention, was inspired by virtual memory and paging concepts from traditional operating systems, applying them to manage the LLM's key-value cache. This method partitions the KV cache into smaller, non-contiguous blocks, which drastically reduces memory waste by allowing for more flexible allocation. - Developed at UC Berkeley, vLLM has quickly become a standard for production LLM serving, used by companies like Anthropic and Replicate to serve billions of tokens daily. Its open-source community has grown rapidly, with contributors increasing from 190 to 740 and monthly downloads surging by 4.5x in 2024. - While TensorRT-LLM is highly optimized for NVIDIA GPUs and can achieve higher throughput in specific scenarios with fixed batch sizes, vLLM often shows better performance with variable-length requests and spiky traffic patterns due to its dynamic continuous batching scheduler. Some user benchmarks have even shown vLLM outperforming TensorRT-LLM in both throughput and latency on H100 GPUs. - Stripe has a history of building its own ML infrastructure to support products like Radar, which blocks fraud. They developed an internal platform called Railyard on Kubernetes to train thousands of models daily, emphasizing a generic API not tied to a single ML framework. - The vLLM 2025 roadmap includes ambitious goals like enabling GPT-4o level performance on a single GPU and introducing disaggregated serving, which separates the compute-bound prompt processing from the memory-bound token generation onto different, specialized hardware. - Cost optimization techniques like quantization are a key focus for vLLM, with over 20% of its deployments already using methods like FP8, GPTQ, and AWQ. Future plans involve enhanced support for new formats and deeper integration with PyTorch's compilation tools to further improve performance. - For enterprise use cases, vLLM's roadmap includes features tailored for reinforcement learning from human feedback (RLHF), such as custom checkpoint loaders and multi-turn scheduling to avoid preemption in long-running agentic tasks.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.