Disaggregated prefill/decode tested on Blackwell GPUs

Engineers are exploring disaggregated architectures to improve inference performance for large language models. One approach being trialed on NVIDIA's Blackwell GPUs involves splitting the prefill and decode stages of inference. This technique, using tools like LMCache, aims to boost concurrency and minimize idle GPU time during serving.

- The prefill stage is compute-bound, processing input tokens in parallel, while the decode stage is memory-bound, generating tokens sequentially. Disaggregating them allows for specialized hardware: GPUs with high compute power for prefill and those with high memory bandwidth for decode, preventing resource contention. - A key trade-off with this architecture is a potential increase in Time to First Token (TTFT) due to the network latency of transferring the KV cache from prefill to decode workers, which can take tens to hundreds of milliseconds. For shorter prompts, a local prefill may be more efficient. - LMCache is an open-source KV caching layer that enables this architecture by efficiently extracting and transferring KV caches between engines. In relevant workloads like multi-round Q&A, combining LMCache with an engine like vLLM can increase throughput by up to 15x. - This approach allows for independent scaling of resources based on workload. For applications with long prompts and short responses (e.g., summarization), prefill workers can be scaled up, while for conversational AI with short prompts and long responses, decode workers can be increased. - NVIDIA's Blackwell architecture features a second-generation Transformer Engine with new 4-bit floating point (FP4) AI capabilities. This can double performance for compute-bound tasks like the prefill phase compared to previous generations while maintaining high accuracy. - The architecture also improves performance in the memory-bound decode phase. The full-rack GB200 NVL72 system, which links 72 Blackwell GPUs, is designed to act as a single massive GPU, promising up to 30x faster real-time inference on trillion-parameter models. - Disaggregation allows for different parallelism strategies to be applied to each stage. For example, tensor parallelism can be used to reduce latency in the prefill stage, while pipeline parallelism can be used to increase throughput for the decode stage.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.