Case Study on High-Throughput RAG

A new case study details the architecture of a Retrieval-Augmented Generation (RAG) system serving two million queries per day with sub-200ms latency. The production pipeline runs on bare-metal NVIDIA H100 clusters and uses vLLM as its serving engine. Key optimizations include dynamic batching, GPU partitioning, and request routing to maximize hardware utilization and cost efficiency.

- The vLLM serving engine's core innovation is PagedAttention, which treats the KV cache like virtual memory in an operating system, breaking it into non-contiguous blocks. This method significantly reduces memory fragmentation, achieving near-optimal memory usage with less than 4% waste, compared to 60-80% waste in traditional systems. - Dynamic batching, also known as continuous batching, processes requests as they arrive and allows new requests to start as soon as old ones finish, token-by-token. This approach maximizes GPU utilization by eliminating idle time spent waiting for the longest sequence in a static batch to complete. - While vLLM offers flexibility and easy integration with Hugging Face models, NVIDIA's TensorRT-LLM can achieve higher throughput and lower latency on NVIDIA GPUs by using more aggressive optimizations like CUDA graph fusion. For instance, under a 1-second time-to-first-token constraint, TensorRT-LLM handled 6 requests per second compared to vLLM's 5, a 16.4% throughput advantage. - The NVIDIA H100 GPU, built on the Hopper architecture, provides up to 9 times faster AI training performance than its predecessor, the A100. For inference, a single H100 can achieve a throughput of roughly 3,500-4,000 tokens per second for a Llama 70B model using vLLM. - On-premise deployment of H100 clusters involves significant upfront investment, with a single GPU costing $25,000-$40,000 and an 8-GPU server reaching $250,000-$400,000. However, for high-utilization workloads (over 60%), this can be 40-60% cheaper over three years than cloud rental, which costs approximately $1.85 to $4.50 per GPU per hour. - Scaling RAG systems introduces challenges beyond inference, including retrieval latency and data quality. As document corpora grow, retrieval precision can drop, and without continuous data updates and versioning of embeddings, the system may provide outdated or irrelevant information. - Enterprise search competitors like Glean and Hebbia are also leveraging RAG. Glean creates a personalized "enterprise knowledge graph" that maps relationships between content, people, and activities to deliver role-specific results. Hebbia focuses on deep analysis of unstructured data and has gained traction in finance and legal sectors. - A key challenge in enterprise RAG is ensuring data governance and permissions are respected. Systems must integrate with existing access control lists from sources like Google Drive or Slack to prevent users from seeing information they are not authorized to access.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.