Google AI Research Aims to Halve LLM Inference Costs
New research from Google AI proposes a "deep-thinking ratio" for large language models to improve reasoning accuracy while cutting total inference costs by half. The technique is particularly valuable for edge deployments, potentially enabling more powerful models to run on-device. This approach addresses key challenges in balancing model performance with computational efficiency.
- This technique redefines how model effort is measured, focusing on "deep-thinking tokens" whose internal predictions only stabilize within the final 15% of the model's layers. - The research directly challenges the "longer is better" assumption in Chain-of-Thought reasoning, finding that longer token counts can actually have a negative correlation (average r=−0.59) with accuracy due to overthinking. - The Deep-Thinking Ratio (DTR) metric demonstrated a strong positive correlation with accuracy (average r = 0.683) across multiple models, including DeepSeek-R1-70B and Qwen3-30B-Thinking. - The cost savings are achieved through a strategy called "Think@n," which uses the DTR score to terminate unpromising generative paths early, thereby reducing wasted computation. - This method adds to a suite of inference optimization techniques like quantization, which reduces model precision from 32-bit to 8-bit or 4-bit, and knowledge distillation, where a smaller "student" model learns from a larger "teacher" model. - The research into efficient reasoning runs parallel to other advanced Google AI projects, such as the "Gemini Deep Think" system that achieved gold-medal level performance at the 2025 International Mathematical Olympiad. - Unlike hardware-centric solutions, this approach is algorithmic, making it complementary to other cost-saving measures like continuous batching, where new requests are processed as soon as a single token is generated for an existing one.