Stripe Cuts Inference Costs 73% With vLLM

Stripe reported a 73% reduction in its machine learning inference costs by deploying vLLM, a high-throughput serving architecture. The system utilizes PagedAttention to achieve throughput gains of 2x to 24x. The implementation offers a blueprint for scaling LLM-based features while managing operational expenses.

- The core technology, PagedAttention, was developed at UC Berkeley and manages the memory for attention keys and values by partitioning the KV cache into blocks, which allows for non-contiguous storage. This method is inspired by virtual memory and paging concepts used in operating systems. - Before this optimization, serving systems often wasted 60-80% of the GPU memory allocated to the KV-Cache due to fragmentation and overallocation. vLLM with PagedAttention reduces this waste to under 4%. - Stripe has a long history of applying machine learning to its products, including powering its Radar fraud detection service and its Billing platform to retry failed charges. The company's ML infrastructure handles hundreds of millions of predictions daily across numerous models. - The adoption of vLLM is widespread among major tech companies; it is used in production by Meta, Mistral AI, Cohere, IBM, and powers features for Amazon's Rufus and LinkedIn. Another public success story comes from Roblox, which saw a 50% reduction in latency for its AI systems after implementing vLLM. - This move addresses the significant operational costs of serving large language models, where expenses for GPU infrastructure, specialized LLMOps engineers (with salaries reaching over $268,000), and model monitoring can lead to substantial financial burdens for companies. - vLLM is an open-source project with a rapidly growing community, seeing its GitHub contributors expand from 190 to 740 in 2024, with contributions from organizations like UC Berkeley, Anyscale, Roblox, IBM, AMD, Intel, and NVIDIA. - For engineers building AI applications, vLLM offers an OpenAI-compatible API server, which simplifies integration and allows developers to query a self-hosted model in the same format as the OpenAI API.

Stripe Cuts Inference Costs 73% With vLLM

Get your own daily briefing