Cut tokens 10x with latent reasoning

- IBM Research AI posted “Thinking Without Words” on arXiv on April 24, showing Abstract-CoT can replace verbose reasoning text with short learned tokens. (arxiv.org) - The headline number is up to 11.6× fewer reasoning tokens, tested on Qwen3-8B and Granite-3.3-8B while keeping performance comparable across tasks. (arxiv.org) - If it holds up, the win is cheaper, lower-latency inference — but the tradeoff is less readable reasoning and early-stage evidence. (arxiv.org)

Reasoning models have a weird cost problem. The answer is often short, but the “thinking” can be long enough to dominate inference time, bandwidth, and memory use. (arxiv.org)ing attention. On April 24, Keshav Ramji, Tahira Naseem, and Ramón Fernandez Astudillo posted “Thinking Without Words,” a method that tries to keep the(arxiv.org)tokens. (arxiv.org) ### What did they actually change? Instead of asking a model to write out a fu(arxiv.org)hod has it emit a short sequence of special abstract tokens from a reserved vocabulary, then produce the final answer. The paper calls this Abstract Chain-of-Thought, or Abstract-CoT. The core bet is simple — the model may not need English-like intermediate steps if a compact internal “reasoning language” can do the same job. (arxiv.org) ### Why are normal reasoning traces so expensive? N(arxiv.org)one token at a time. That means every intermediate step has to be generated, stored in the KV cache, and moved through the system before the model can finish. The paper frames that as an inference bottleneck, especially for tasks where the rationale is much longer than the final answer. (arxiv.org) ### So how big is the claimed win? The main headline is up to 11.6× fewer reasoning tokens in the arXiv version, with (arxiv.org)en3-8B and Granite-3.3-8B. The authors say performance stays comparable across mathematical reasoning, instruction-following, and multi-hop reasoning tasks. That is the part people care about — not just shorter traces, but shorter traces without the usual accuracy collapse. (arxiv.org) ### How do the abstract tokens become meaningful? They are not m(arxiv.org)st bottlenecks verbal chain-of-thought into masked forms, then teaches the model to generate abstract tokens from the prompt alone, and finally fine-tunes that behavior with reinforcement learning under constrained decoding. Basically, the model learns a compact codebook for reasoning rather than being handed one by humans. (arxiv.org) ### Is this the same as pure latent reasoning? Not quite. Some late(arxiv.org)nuous hidden states that humans cannot inspect directly. IBM’s approach is different — it still uses discrete generated tokens, just not normal words. That makes it a middle ground: more efficient than verbose text, but more structured than fully hidden-state-only reasoning. (arxiv.org) ### Why does that matter for deployment? Because inference cost is often driven by token traffic. Fewer reasoning (arxiv.org)ure, and lower bandwidth between components. For edge deployments or high-volume serving, that matters a lot. A 10× cut in reasoning length is not a cosmetic optimization — it changes how expensive “reasoning mode” is to run. This is an inference-side win, though, not proof that total system cost drops once training overhead is counted. (arxiv.org) ### (arxiv.org)y. If the model reasons in a learned abstract code, humans lose the readable scratch work that made chain-of-thought attractive in the first place. And this is still early research — an arXiv paper, not a settled production standard. Other recent work on latent reasoning also points to efficiency gains, but the field is still sorting out how these systems actually think and how reliably they generalize. (arxiv.org) ### Bottom line This paper matters because it (arxiv.org)ciencies — making models “think out loud” in full sentences when they may not need to. If Abstract-CoT scales, reasoning models could get a lot cheaper to run. But the price of that efficiency is that the reasoning becomes less human-readable right when people want more visibility, not less. (arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.