Stripe Cuts Inference Costs 73% with vLLM Migration

Stripe achieved a 73% cost reduction in its AI inference serving by migrating its systems to vLLM, an open-source library for large language model inference. The move to a high-throughput architecture using techniques like PagedAttention allowed the company to scale its AI workloads more efficiently. This case study demonstrates how architectural choices in AI deployment can significantly impact operational costs.

- The vLLM library originated at UC Berkeley's Sky Computing Lab and is now a community-driven open-source project. It is designed for high-throughput and memory-efficient LLM inference, supporting models from Hugging Face and hardware including NVIDIA, AMD, and AWS accelerators. - The core innovation, PagedAttention, is an attention algorithm inspired by virtual memory and paging in operating systems. It partitions the KV cache into blocks that can be stored non-contiguously, reducing memory waste to less than 4%, compared to 60-80% in less efficient systems. - vLLM replaces traditional static batching with continuous batching, also known as in-flight or dynamic batching. This allows the server to start processing new requests as soon as individual sequences in a batch are completed, maximizing GPU utilization and delivering up to 24x higher throughput than standard HuggingFace Transformers. - This infrastructure is critical for Stripe's broader AI strategy, which includes a transformer-based Payments Foundation Model trained on billions of transactions. This model treats financial data like a language to understand patterns, which improved the detection rate for certain fraud attacks on large users from 59% to 97%. - The memory sharing enabled by PagedAttention is particularly effective for complex sampling methods like beam search or parallel sampling, cutting their memory usage by up to 55%. This can directly translate into a throughput increase of up to 2.2x for those specific use cases. - In the broader landscape of inference servers, vLLM is often compared to alternatives like Nvidia's TensorRT-LLM and Hugging Face's TGI. While TensorRT-LLM is often chosen for achieving the absolute lowest latency on specific NVIDIA hardware, vLLM is favored for its high concurrency, flexibility across different hardware, and ease of use.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.