Stripe Cuts Inference Costs with vLLM
What happened
Stripe achieved a 73% reduction in inference costs and a throughput increase of up to 24x by deploying the open-source library vLLM in production. The implementation demonstrates how high-throughput, low-latency architectures are becoming critical for scaling agentic AI workloads cost-effectively.
Why it matters
- The core innovation enabling vLLM's performance is PagedAttention, a memory management algorithm inspired by virtual memory in operating systems. This technique partitions the key-value (KV) cache into fixed-size blocks, which minimizes memory waste to under 4% compared to the 60-80% fragmentation seen in traditional systems. - vLLM originated as an open-source project from the Sky Computing Lab at UC Berkeley and has grown through community contributions from both academia and industry. It integrates with other high-performance components like FlashAttention and supports a wide range of Hugging Face models, including Mixture-of-Experts (MoE) and multi-modal LLMs. - The library is designed for high-throughput, multi-user serving, which contrasts with engines like llama.cpp, optimized for single-stream efficiency on a wider range of hardware. Benchmarks show vLLM's throughput scales significantly with concurrent user loads, while llama.cpp's remains flat, making vLLM better suited for high-traffic applications. - Anyscale, a company founded by the creators of the open-source Ray framework, provides a managed platform for scaling vLLM deployments. This offers a production-ready infrastructure layer, handling orchestration and scaling for LLM inference workloads. - Stripe's implementation of vLLM supports a broader strategy around "agentic workflows," where AI agents perform complex tasks. The company recently released the Stripe Agent Toolkit, a library that allows agents to securely use financial services like issuing virtual cards and processing payments via API calls. - The company had to build a dedicated "Agent Service" because traditional ML inference infrastructure could not handle the unique characteristics of agent workloads. Agentic systems are network I/O bound with long timeouts and non-deterministic execution, differing from the compute-bound, deterministic nature of traditional ML models.
Key numbers
- Stripe achieved a 73% reduction in inference costs and a throughput increase of up to 24x by deploying the open-source library vLLM in production.
- This technique partitions the key-value (KV) cache into fixed-size blocks, which minimizes memory waste to under 4% compared to the 60-80% fragmentation seen in traditional systems.
What happens next
- The company had to build a dedicated "Agent Service" because traditional ML inference infrastructure could not handle the unique characteristics of agent workloads.
Quick answers
What happened in Stripe Cuts Inference Costs with vLLM?
Stripe achieved a 73% reduction in inference costs and a throughput increase of up to 24x by deploying the open-source library vLLM in production. The implementation demonstrates how high-throughput, low-latency architectures are becoming critical for scaling agentic AI workloads cost-effectively.
Why does Stripe Cuts Inference Costs with vLLM matter?
The core innovation enabling vLLM's performance is PagedAttention, a memory management algorithm inspired by virtual memory in operating systems. This technique partitions the key-value (KV) cache into fixed-size blocks, which minimizes memory waste to under 4% compared to the 60-80% fragmentation seen in traditional systems. vLLM originated as an open-source project from the Sky Computing Lab at UC Berkeley and has grown through community contributions from both academia and industry. It integrates with other high-performance components like FlashAttention and supports a wide range of Hugging Face models, including Mixture-of-Experts (MoE) and multi-modal LLMs. The library is designed for high-throughput, multi-user serving, which contrasts with engines like llama.cpp, optimized for single-stream efficiency on a wider range of hardware. Benchmarks show vLLM's throughput scales significantly with concurrent user loads, while llama.cpp's remains flat, making vLLM better suited for high-traffic applications. Anyscale, a company founded by the creators of the open-source Ray framework, provides a managed platform for scaling vLLM deployments. This offers a production-ready infrastructure layer, handling orchestration and scaling for LLM inference workloads. Stripe's implementation of vLLM supports a broader strategy around "agentic workflows," where AI agents perform complex tasks. The company recently released the Stripe Agent Toolkit, a library that allows agents to securely use financial services like issuing virtual cards and processing payments via API calls. The company had to build a dedicated "Agent Service" because traditional ML inference infrastructure could not handle the unique characteristics of agent workloads. Agentic systems are network I/O bound with long timeouts and non-deterministic execution, differing from the compute-bound, deterministic nature of traditional ML models.