Finds 11.6× shorter latent reasoning
- IBM Research AI posted “Thinking Without Words” on April 24, introducing Abstract Chain-of-Thought, a way for language models to reason with reserved abstract tokens. - In the paper’s headline result, Abstract-CoT used up to 11.6× fewer reasoning tokens while keeping performance comparable on math, instruction, and multi-hop tasks. - It matters because chain-of-thought is expensive and often unfaithful, so shorter latent traces could cut cost while weakening interpretability.
Reasoning models usually “think out loud.” They print a long chain of words before they answer. That helps on hard tasks, but it is slow, expensive, and maybe a little misleading — because the visible explanation may not be the real computation happening inside the model. IBM Research AI’s new paper pushes on that gap with a simple idea: let the model think in a compact internal code instead of English, then answer normally. (arxiv.org) ### What actually changed? The paper is called *Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought*. It was posted to arXiv on April 24 and revised on April 27. The authors — Keshav Ramji, Tahira Naseem, and Ramón Fernandez Astudillo at IBM Research AI — call the method Abstract-CoT. Instead of generating a natural-language chain-of-thought, the model first emits a short sequence of tokens from a reserved vocabulary, then produces the final answer. (arxiv.org) ### What is “latent reasoning” here? Basically, it means the model is still doing multi-step reasoning, but not in plain language. The intermediate trace is a compact code. Not hidden state all the way down, not a normal English rationale either — more like a learned shorthand the model can use internally. In this paper, that shorthand is discrete, not continuous, which matters because it can still be generated token by token and optimized with familiar post-training methods. (arxiv.org) ### Why bother replacing words? Because words are a clunky format for internal thought. Long chain-of-thought traces eat inference tokens, raise latency, and increase cost. The paper also leans on a broader argument now circulating in the field: visible chain-of-thought may be useful, but it is not necessarily a faithful window into how the model really solved the problem. If the real work is happening in latent states, forcing everything through English may be wasteful. (arxiv.org) ### How do they teach a model this code? The trick is a warm-up loop. First, the model starts from verbal chain-of-thought and learns a bottlenecked version by masking and supervised fine-tuning. Then it self-distills — training itself to produce the abstract tokens directly from the prompt using constrained decoding over the codebook. After that, the authors use warm-started reinforcement learning, again with constrained decoding, to improve the(arxiv.org)a secret language from scratch in one jump — it gets there through compression and refinement. (arxiv.org) ### What was the result? The headline number is up to 11.6× fewer reasoning tokens. And the important part is that this did not come with a big collapse in quality in the paper’s reported tests. The authors say performance stayed comparable across mathematical reasoning, instruction-following, and multi-hop reasoning, and that the approach generalized across language model families. That is the news hook — a much shorter reasoning trace without obvi(arxiv.org)about. (arxiv.org) ### Does that mean the model is “thinking without words”? Sort of — but don’t overread the phrase. The model still outputs tokens. They are just tokens from a reserved abstract vocabulary instead of normal sentences. So this is not proof that language was never needed, and it is not a full map of hidden-state reasoning. It is better understood as a practical middle ground between verbose English rationales and fully continuous latent reasoning. (ar([arxiv.org)# Why does this hit the interpretability debate? Because it cuts both ways. On one hand, shorter latent traces could make reasoning cheaper and maybe more scalable. On the other hand, if the best-performing intermediate representation is less human-readable, then one of the field’s favorite inspection tools — reading the chain-of-thought — looks weaker as a window into the real process. A recent position paper makes that argument directly: the pri(arxiv.org)tories, not the visible explanation. (arxiv.org) ### Bottom line? This paper does not settle what reasoning “really is” in language models. But it does show something concrete: you can compress the visible reasoning trace hard — by as much as 11.6× in this setup — and still keep performance in the same ballpark. That is a useful result for anyone building reasoning models, and a mildly uncomfortable one for anyone treating chain-of-thought as the thought itself. (arxiv.org)