Stripe Cuts LLM Inference Costs by 73% with vLLM

Stripe has reportedly cut its large language model (LLM) inference costs by 73% by adopting vLLM, a high-throughput serving engine. The implementation of features like PagedAttention is credited with delivering throughput gains of 2 to 24 times. A detailed guide outlines their production architecture, covering batching, memory management, and scalable serving for cost-effective LLM infrastructure.

- The core innovation of vLLM, PagedAttention, draws inspiration from virtual memory and paging in traditional operating systems. It partitions the memory-intensive key-value (KV) cache into smaller, non-contiguous blocks, allowing for more flexible and efficient memory allocation, similar to how an OS manages RAM. - vLLM originated as a research project at UC Berkeley's Sky Computing Lab and was introduced in a 2023 paper titled "Efficient Memory Management for Large Language Model Serving with PagedAttention". Since its open-source release, it has evolved into a production-grade serving system. - While vLLM excels at maintaining low Time To First Token (TTFT) even under high concurrent user loads, benchmarks show its token generation rate can be lower than other frameworks like LMDeploy, especially for quantized models. This makes it ideal for latency-critical interactive applications. - Stripe is leveraging LLMs for more than just inference optimization; they have developed an LLM-powered system to assist human support agents and another transformer-based model to analyze structured financial data for fraud detection. They also offer an "agent toolkit" for developers to integrate payment functionalities into LLM-based agentic workflows. - The cost of LLM inference for a given quality level has been dropping dramatically, with some analyses showing a 1000x decrease over three years. This trend is driven by more efficient open-source models and optimized serving systems like vLLM. - In production, LLM inference presents two distinct bottlenecks: the "prefill" phase for processing the initial prompt, which is compute-intensive, and the "decode" phase for generating subsequent tokens, which is memory-bandwidth intensive. vLLM's PagedAttention primarily targets the memory-bound decoding phase. - Major tech companies are tackling LLM serving challenges with different strategies. Meta is developing advanced parallelism techniques and disaggregating prefill and decoding into separate services for independent scaling. Netflix is building a unified framework to optimize post-training and serving of models adapted to their specific data. - The vLLM open-source project supports a wide range of popular LLM architectures from Hugging Face, including Mixture-of-Experts (MoE) models like Mixtral, and offers features like continuous batching, various quantization methods (GPTQ, AWQ), and speculative decoding.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.