Stripe Cuts Inference Costs with vLLM

Stripe achieved a 73% reduction in inference costs and a throughput increase of up to 24x by deploying the open-source library vLLM in production. The implementation demonstrates how high-throughput, low-latency architectures are becoming critical for scaling agentic AI workloads cost-effectively.

- The core innovation enabling vLLM's performance is PagedAttention, a memory management algorithm inspired by virtual memory in operating systems. This technique partitions the key-value (KV) cache into fixed-size blocks, which minimizes memory waste to under 4% compared to the 60-80% fragmentation seen in traditional systems. - vLLM originated as an open-source project from the Sky Computing Lab at UC Berkeley and has grown through community contributions from both academia and industry. It integrates with other high-performance components like FlashAttention and supports a wide range of Hugging Face models, including Mixture-of-Experts (MoE) and multi-modal LLMs. - The library is designed for high-throughput, multi-user serving, which contrasts with engines like llama.cpp, optimized for single-stream efficiency on a wider range of hardware. Benchmarks show vLLM's throughput scales significantly with concurrent user loads, while llama.cpp's remains flat, making vLLM better suited for high-traffic applications. - Anyscale, a company founded by the creators of the open-source Ray framework, provides a managed platform for scaling vLLM deployments. This offers a production-ready infrastructure layer, handling orchestration and scaling for LLM inference workloads. - Stripe's implementation of vLLM supports a broader strategy around "agentic workflows," where AI agents perform complex tasks. The company recently released the Stripe Agent Toolkit, a library that allows agents to securely use financial services like issuing virtual cards and processing payments via API calls. - The company had to build a dedicated "Agent Service" because traditional ML inference infrastructure could not handle the unique characteristics of agent workloads. Agentic systems are network I/O bound with long timeouts and non-deterministic execution, differing from the compute-bound, deterministic nature of traditional ML models.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.