Stripe Cuts Inference Costs with vLLM
Stripe achieved a 73% reduction in inference costs and a throughput increase of up to 24x by deploying the open-source library vLLM in production. The implementation demonstrates how high-throughput, low-latency architectures are becoming critical for scaling agentic AI workloads cost-effectively.
- The core innovation enabling vLLM's performance is PagedAttention, a memory management algorithm inspired by virtual memory in operating systems. This technique partitions the key-value (KV) cache into fixed-size blocks, which minimizes memory waste to under 4% compared to the 60-80% fragmentation seen in traditional systems. - vLLM originated as an open-source project from the Sky Computing Lab at UC Berkeley and has grown through community contributions from both academia and industry. It integrates with other high-performance components like FlashAttention and supports a wide range of Hugging Face models, including Mixture-of-Experts (MoE) and multi-modal LLMs. - The library is designed for high-throughput, multi-user serving, which contrasts with engines like llama.cpp, optimized for single-stream efficiency on a wider range of hardware. Benchmarks show vLLM's throughput scales significantly with concurrent user loads, while llama.cpp's remains flat, making vLLM better suited for high-traffic applications. - Anyscale, a company founded by the creators of the open-source Ray framework, provides a managed platform for scaling vLLM deployments. This offers a production-ready infrastructure layer, handling orchestration and scaling for LLM inference workloads. - Stripe's implementation of vLLM supports a broader strategy around "agentic workflows," where AI agents perform complex tasks. The company recently released the Stripe Agent Toolkit, a library that allows agents to securely use financial services like issuing virtual cards and processing payments via API calls. - The company had to build a dedicated "Agent Service" because traditional ML inference infrastructure could not handle the unique characteristics of agent workloads. Agentic systems are network I/O bound with long timeouts and non-deterministic execution, differing from the compute-bound, deterministic nature of traditional ML models.