Chain‑of‑Thought Boost
New deep‑learning work called FIPO reportedly pushes chain‑of‑thought capability out past 10,000 tokens on Qwen2.5‑32B, yielding a 56% AIME 2024 Pass@1 — a big step for long-reasoning LLM tasks. At the same time people are re-checking the field’s foundations by revisiting ResNet, AlexNet, Transformers and Adam to understand how old building blocks still shape today’s models. ( )
A language model does not solve a hard math problem in one jump. It writes one token at a time, and each token is a tiny bet on what should come next, which is why long reasoning can fall apart halfway through. (arxiv.org) Researchers call that running explanation a chain of thought, and in practice it is just the model’s scratch work stretched across hundreds or thousands of tokens. When the scratch work is too short, the model quits early; when it is too noisy, the final answer drifts. (arxiv.org) Training usually rewards the whole answer at the end, like grading a 20-step algebra solution with one red checkmark and no notes in the margin. That makes it hard for the model to learn which early token helped and which one sent the rest of the solution off course. (arxiv.org) That problem is called credit assignment, and it has been one of the central bottlenecks in reinforcement learning for language models. If every token gets nearly the same reward, the model has little reason to learn deeper multi-step plans. (arxiv.org) The new work, called Future-KL Influenced Policy Optimization, tries to fix that by scoring tokens according to how the rest of the reasoning path changes after them. In plain terms, it asks which earlier move made the later moves better, instead of paying every move the same wage. (arxiv.org) The authors tested that idea on Qwen2.5-32B-Base, a 32 billion parameter model from Alibaba’s Qwen family. Qwen’s official release says the Qwen2.5 line includes 14 billion and 32 billion parameter models trained on up to 18 trillion tokens. (qwen.ai) According to the FIPO paper, the average chain-of-thought length on Qwen2.5-32B grew from about 4,000 tokens to more than 10,000 tokens. On the American Invitational Mathematics Examination 2024 benchmark, Pass@1 rose from 50.0 percent to a peak of 58.0 percent and settled near 56.0 percent. (arxiv.org) The model card published with the release reports the same headline result and says the method beat reproduced pure reinforcement-learning baselines such as DAPO and DeepSeek-R1-Zero-32B on that setup. The GitHub repository for the project repeats those numbers and frames the gain as longer reasoning rather than simple verbosity. (huggingface.co, github.com) That distinction matters because long answers are cheap and useful long answers are not. The paper says the added tokens increasingly show self-reflection and multi-pass verification, which means the model often checks its own work instead of merely talking longer. (arxiv.org) At the same time, the conversation around this result has turned backward as much as forward. People are revisiting four older building blocks—AlexNet, ResNet, the Transformer, and Adam—to see how much of today’s progress still rests on ideas laid down between 2012 and 2017. (proceedings.neurips.cc, arxiv.org, arxiv.org, arxiv.org) AlexNet was the 2012 shock to the system. Krizhevsky, Sutskever, and Hinton trained a deep convolutional network with 60 million parameters and won the 2012 ImageNet competition with a top-5 error rate of 15.3 percent, far ahead of the 26.2 percent runner-up. (proceedings.neurips.cc) ResNet was the 2015 to 2016 fix for depth itself. Kaiming He and colleagues showed that skip connections let networks reach 152 layers while remaining easier to optimize, turning “just make it deeper” from a failure mode into a workable recipe. (arxiv.org, cv-foundation.org) The Transformer was the 2017 redesign that removed recurrence and built sequence modeling around attention alone. Vaswani and coauthors reported better translation quality with more parallel training, and that architecture became the chassis for modern large language models. (arxiv.org, research.google) Adam was the optimizer that made huge models easier to train in the first place. Kingma and Ba described it in 2014 as a first-order method using adaptive estimates of first and second moments, and its mix of speed, stability, and low memory cost made it standard across deep learning. (arxiv.org) FIPO fits that older pattern more than it breaks from it. It does not replace the Transformer or invent a new base model; it changes how reinforcement learning assigns blame and credit inside a familiar training stack, which is why the result feels new and strangely old at the same time. (arxiv.org, arxiv.org, arxiv.org) The safest reading, for now, is that long reasoning is becoming less of a scaling accident and more of an optimization problem engineers can target directly. If that holds up beyond one benchmark and one model family, the next jump in language models may come from better training signals, not just bigger models and more data. (arxiv.org, huggingface.co)