New Decoding Method Promises 2x Faster LLM Inference
A new technique called "Speculative Speculative Decoding" enables parallel drafting and verification in LLMs, reportedly achieving up to a 2x speedup over standard speculative decoding. The method offers a practical path toward more efficient and cost-effective inference for production systems.
The new "Speculative Speculative Decoding" (SSD) method was introduced by researchers Tanishq Kumar, Tri Dao, and Avner May from Stanford University, Princeton University, and Together AI. Their work tackles a key bottleneck in standard speculative decoding: the sequential dependence between drafting new tokens and verifying them. Standard speculative decoding uses a smaller, faster "draft" model to generate a sequence of token predictions, which are then verified in a single pass by the larger, more accurate "target" model. This is faster than the one-token-at-a-time generation of traditional autoregressive decoding. However, the draft model has to wait for the verification to complete before it can start generating the next batch of speculative tokens. SSD breaks this sequential dependency by parallelizing the drafting and verification processes. While the target model is busy verifying one set of drafted tokens, the draft model is already generating new speculative sequences for various potential outcomes of the ongoing verification. If the actual outcome matches one of these pre-computed "speculations," the next draft is ready immediately, which eliminates the drafting overhead. The researchers developed an optimized SSD algorithm called Saguaro to address three main challenges: accurately predicting the verification outcomes, balancing the quality of the speculative drafts with the likelihood of a correct prediction, and efficiently handling cases where the prediction is wrong. Saguaro has been shown to be up to 2x faster than already optimized speculative decoding and up to 5x faster than standard autoregressive decoding. For startups building AI products, inference speed and cost are critical. While early-stage companies might initially rely on straightforward model APIs for speed of development, as they scale, optimizing inference becomes crucial for managing costs and improving user experience. Techniques like speculative decoding, and now SSD, represent a trade-off: they can significantly reduce latency and increase throughput, but also add complexity to the engineering stack. The decision to implement advanced techniques like SSD reflects an engineering culture that is deeply focused on performance and efficiency. For a startup engineer, being able to navigate these trade-offs—choosing when to stick with a simpler solution and when to invest in a more complex but more performant one—is a key skill. The development of open-source projects and frameworks that incorporate these methods is lowering the barrier to entry for smaller teams.