KV Cache Dominates VRAM in Large Models
The memory economics of deploying large language models have shifted, with the KV cache now dominating VRAM usage, especially for models with long context windows. An analysis of the Kimi K2.5 model highlights that batch size, context length, and the KV cache are the primary cost drivers. This makes aggressive quantization and careful context management essential for cost-effective deployment on inference servers.
- The formula for calculating the KV cache size in bytes is: `batch_size * sequence_length * num_layers * num_heads * head_dimension * 2 * data_type_size`. This linear growth with sequence length is the primary reason it becomes a bottleneck in long-context scenarios. - Architectural innovations like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) directly reduce the KV cache size. MQA uses a single key and value head across all query heads, while GQA provides a balance by having multiple query heads share a smaller number of key/value heads. Models like Llama 2 and Mistral 7B have adopted GQA to improve efficiency. - Quantizing the KV cache to lower precision data types like INT8, FP8, or even INT4 is a common optimization that can halve or quarter the memory footprint. While this can introduce a slight trade-off in generation speed due to quantization/dequantization overhead, it significantly increases the effective context length a model can handle. Research has shown that with techniques like removing outliers, 3-bit quantization is achievable with less than 0.1 perplexity degradation. - Serving engines like vLLM employ a memory management technique called PagedAttention, inspired by virtual memory and paging in operating systems. This method divides the KV cache into blocks, allowing for non-contiguous storage, which drastically reduces memory waste from fragmentation to under 4% and enables more efficient memory sharing between requests. - Advanced inference servers and libraries like NVIDIA's TensorRT-LLM offer fine-grained control over the KV cache. This includes features like prioritized eviction policies (going beyond simple LRU), allowing users to define which parts of the cache are more critical to retain based on the workload. It also supports offloading less frequently used cache blocks from GPU HBM to CPU RAM. - Prefix caching, or cache reuse, is a technique where the KV cache for a shared prefix (like a system prompt or a common document context) is computed once and then reused across multiple subsequent requests. This significantly reduces redundant computation and lowers the time-to-first-token (TTFT) latency.