KV cache identified as VRAM bottleneck

For large language models with long context windows, the VRAM footprint is increasingly dominated by the Key-Value (KV) cache, rather than just model weights, especially under high concurrency. This makes VRAM a significant gating factor for performance and cost. Engineers are using strategies like quantization, intelligent sharding, and cache offloading to manage these hardware constraints.

- The vLLM project introduced PagedAttention, an algorithm inspired by virtual memory and paging in operating systems, to manage the KV cache. This technique divides the KV cache into blocks that can be stored in non-contiguous memory, mitigating fragmentation and increasing memory utilization to nearly 96%. - Architectural changes to the attention mechanism itself, such as Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), are used to reduce the size of the KV cache. In GQA, multiple query heads share a single key and value head, which is a compromise between the single key-value head of MQA and the multiple heads of traditional multi-head attention. For example, the Llama-2-70B model uses 64 query heads but only 8 key-value heads, reducing the cache size by a factor of 8. - For a model like Llama 3 70B, handling a 1 million token context in float16 would require approximately 330GB of VRAM just for the KV cache. This memory requirement grows linearly with the sequence length and batch size, making it a primary constraint. - FlashAttention-2, a memory-aware attention algorithm, can deliver up to a 2x speedup over the original FlashAttention by optimizing work partitioning on the GPU to reduce shared memory reads and writes. This results in training speeds of up to 225 TFLOP/s on A100 GPUs. - Quantizing the KV cache to lower precision formats like INT8 or INT4 can halve or quarter its memory footprint respectively. More advanced techniques like 2-bit quantization have been shown to reduce peak memory usage by 2.6x and improve throughput by up to 3.47x with less than a 1% accuracy drop. - CPU or NVMe offloading is a technique where parts of the KV cache are moved from GPU VRAM to CPU RAM or solid-state drives. This is particularly effective for managing inactive or less frequently accessed cache data in high-concurrency scenarios, allowing the GPU to serve more users. - Systems like vLLM with PagedAttention can eliminate up to 80% of memory waste caused by internal and external fragmentation in traditional KV cache allocation. This improved efficiency allows for larger batch sizes and, consequently, higher throughput. - While offloading the KV cache can improve memory efficiency, the transfer speed between the GPU and the offloading target (like CPU RAM) is critical. If the data transfer overhead is higher than the cost of recomputing the cache, the benefits are negated, making this a key consideration for latency-sensitive applications.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.