Google Reduces LLM Latency

Google Research is using a technique called "speculative decoding" to halve the inference latency of its large language models. The method, which pre-computes likely next tokens, has reduced user-perceived latency by over 40% in production workloads, even at large batch sizes.

- The core technique involves using a smaller, faster "draft" model to predict a sequence of upcoming tokens, which are then verified in a single, parallel batch by the larger, more accurate "target" model. This "draft-then-verify" approach ensures the final output is identical to what the larger model would have produced alone, guaranteeing no loss in quality. - Google first introduced this method in a 2022 paper titled "Fast Inference from Transformers via Speculative Decoding" and now uses it in production for features like AI Overviews in Search. In original tests on translation and summarization, the technique yielded a 2x–3x speedup. - The primary bottleneck in autoregressive generation is memory bandwidth; each new token requires a full forward pass, which leaves powerful GPUs underutilized while waiting for data. Speculative decoding improves GPU utilization by validating multiple tokens at once. - The performance gain depends heavily on the draft model's acceptance rate and latency. Interestingly, a draft model's raw predictive power doesn't always correlate with the best performance; its speed is often a more critical factor. - Several variations of this technique exist, including "self-speculative decoding" where the same model acts as both drafter and verifier, and methods like Medusa, which adds multiple prediction "heads" to the target model to generate drafts internally. - The number of speculative tokens to generate is a key parameter that requires tuning; code-generation models can often benefit from a higher number of speculative tokens (6-8) compared to language models (3-4). - This optimization is most effective for latency-sensitive applications like chatbots and at small batch sizes where compute resources are often idle. Its advantage diminishes at very large batch sizes. - The concept is inspired by "speculative execution" in computer processors, where operations are performed ahead of time and discarded if the initial assumptions turn out to be wrong.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.