USTC shows RL post‑training scales
- USTC, Shanghai AI Lab, and Oxford researchers showed RL post-training on Qwen2.5 math models scales predictably — and bigger models get more from it. - Across 54 experiments from 0.5B to 72B parameters, loss followed a power law, while data-limited performance depended more on optimization steps. - That matters because RL tuning now looks less like artisanal tweaking and more like a scalable recipe for reasoning gains.
Reinforcement learning after pretraining has had a weird status in AI. Everyone could see it mattered for reasoning models, but the field still lacked a clean answer to a basic scaling question — if you spend more RL compute on a bigger model, do the returns actually behave in a predictable way? This paper says yes. A team spanning USTC, Shanghai AI Lab, Oxford, and several other institutions ran a broad set of math-reasoning experiments on the Qwen2.5 family and found that RL post-training follows a usable scaling law, not just trial-and-error luck. ### What did they actually test? They used dense Qwen2.5 models from 0.5B up to 72B parameters and ran 54 controlled RL post-training experiments focused on mathematical reasoning. The point was not to crown one model. It was to map how model size, training data, and compute budget interact once you are already past pretraining and into RL fine-tuning. Pretraining has had scaling laws for years. That gave labs a rough map for how many parameters, tokens, and FLOPs to buy. RL post-training, by contrast, has been more like expensive kitchen intuition — tweak rewards, run jobs, hope the model gets better. If RL also scales smoothly, labs can plan it instead of guessing. The result is that larger models were more efficient on both compute and data during RL post-training. Under a fixed compute budget, bigger models trained for fewer steps beat smaller models trained for more steps. Under a fixed data budget, bigger models also reached lower loss. That is the headline — scale helps twice. Where? Basically, performance did not improve randomly. Test loss, compute, and data fit a predictive power-law relationship across both base and instruction-tuned models. That matters because power laws are the closest thing modern AI has to a planning equation. If the curve holds, you can estimate what extra RL compute or a larger base model is likely to buy you before spending the money. ### Did they find any catch? Yes — and it is an important one. The paper says the learning-efficiency term, written as k(N), shows a latent saturation trend as model size keeps increasing. So bigger still helps, but the gains are not claiming infinite acceleration forever. This is more “scales with some eventual flattening” than “just keep making it huge.” One of the more interesting findings is that in data-constrained settings, reusing high-quality data worked surprisingly well. Final performance depended more on the total number of optimization steps than on having fully unique samples every time. In plain English — if your math RL data is good, looping over it more can still pay off. Does this change the reasoning debate? Not really. This is about mathematical reasoning under a specific RL post-training setup, not a universal law for every domain or every reward design. But it does sharpen the picture. A lot of recent reasoning progress looked like maybe it came from secret sauce. Turns out a meaningful chunk may come from something more boring and more powerful — scaling discipline. ### Why should anyone outside labs care? Because this changes where the frontier may move next. If RL post-training rewards larger models disproportionately, then the best reasoning systems may improve not just by better pretraining, but by more aggressive post-training on top of already-large bases. That shifts budgets, infrastructure priorities, and maybe even who can compete. The bottom line is simple. This paper makes RL post-training look more like an engineering scaling problem and less like alchemy. For labs chasing better reasoning, that is a very useful change.