Stripe Cuts LLM Inference Costs by 73% with vLLM

Stripe reported a 73% reduction in its inference serving costs by deploying vLLM, a high-throughput framework for large language models. The company achieved 2-24x throughput gains by implementing a careful architecture using PagedAttention, autoscaling, and batching. This sets a new benchmark for cost efficiency and reliability in production AI workloads.

- The core innovation of vLLM is PagedAttention, an algorithm inspired by virtual memory and paging in operating systems. It divides the large memory stores used for context (KV caches) into smaller, fixed-size blocks, which significantly reduces memory waste caused by fragmentation and over-allocation. - Before solutions like vLLM, memory waste in the KV cache could be as high as 60-80%, as systems had to pre-allocate memory for the maximum possible output length of a request. PagedAttention avoids this by allocating memory on-demand, leading to near-optimal memory utilization with an average waste of only 4%. - vLLM was originally developed at the Sky Computing Lab at UC Berkeley and is now a community-driven open-source project. This allows for transparent development, customization, and faster bug fixes. - In addition to PagedAttention, vLLM employs continuous batching, which processes incoming requests dynamically instead of waiting for a fixed-size batch to fill up. This keeps the GPU constantly utilized, minimizing idle time and maximizing throughput. - While Stripe's 73% cost reduction announcement focused on inference, the company has also developed its own foundation model trained on tens of billions of financial transactions to improve fraud detection. This model's application boosted detection rates for attacks on large users from 59% to 97% overnight. - Stripe also utilizes LLMs to assist its internal support operations, providing AI-generated response suggestions to help agents solve customer issues more efficiently across a complex product suite. The system is designed as an agent-assistance tool, not a customer-facing chatbot. - The vLLM framework is designed for flexibility, supporting a wide range of popular open-source models from Hugging Face, including Transformer-based models like Llama, Mixture-of-Experts models like Mixtral, and multi-modal models. It also provides an OpenAI-compatible API server, making it easier for developers to integrate.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.