RL is getting reasoning data

Researchers and practitioners are feeding reinforcement‑learning models synthetic math and coding problems to boost reasoning, not just raw scale, and that’s showing up in today’s conversation about next‑gen models. @ClayCampaigne highlighted recent RL advances in reasoning models — citing work like OpenAI’s o1‑preview that uses synthetic math/coding datasets — as a driver of improved chain‑of‑thought behavior in models. (x.com) (x.com)

The old way to make a language model smarter was mostly to pour in more text and more computing power, like making a bigger library and a bigger engine at the same time. The new twist is to give the model problems with checkable right answers, then reward the steps that lead to the right result. (openai.com) That reward process is called reinforcement learning, and it works more like training a dog with a treat than filling a notebook with facts. Instead of only copying patterns from internet text, the model gets signal from whether its answer on a math proof, coding task, or logic puzzle actually works. (openai.com) This works especially well on synthetic data, which means problems generated by software rather than written one by one by humans. Math equations, programming contests, and formal logic tasks are useful here because a computer can usually tell whether the final answer is correct. (arxiv.org) OpenAI said this directly when it introduced o1-preview on September 12, 2024. The company wrote that its large-scale reinforcement learning algorithm teaches the model to “think productively” and that performance keeps improving with more reinforcement learning during training and more time spent thinking during use. (openai.com) That is a different scaling law from the one that built earlier chatbots. OpenAI said the limits on this approach “differ substantially” from ordinary pretraining, which is the giant text-ingestion phase that made earlier models fluent in the first place. (openai.com) You can see the effect in the kinds of tests these models now target. OpenAI’s o1 launch said the model was built to solve harder problems in science, coding, and math, and a later OpenAI research paper said reinforcement learning significantly boosts performance on complex coding and reasoning tasks. (openai.com) (arxiv.org) The coding side matters because code is easy to grade compared with essays. A program either passes the unit tests or it does not, which gives reinforcement learning a clean reward signal instead of a vague thumbs-up. (arxiv.org) DeepSeek pushed the same idea further in January 2025 with DeepSeek-R1. Its paper said reasoning abilities can emerge through pure reinforcement learning without human-written reasoning traces, and the model was trained on verifiable tasks in mathematics, coding competitions, and science. (arxiv.org) DeepSeek also showed the limit of pure reward training. Its GitHub release says the early DeepSeek-R1-Zero model developed endless repetition, poor readability, and language mixing, so the company added “cold-start” data before reinforcement learning to make the reasoning more usable. (github.com) That is why today’s reasoning race is not just about bigger models. It is about finding more tasks where the answer can be checked automatically, so the model can practice thousands or millions of times the way a chess engine improves by playing games against itself. (arxiv.org) (openai.com) Anthropic’s later products point in the same direction from the user side. Claude 3.7 Sonnet and Claude 4 both added modes that let the model spend longer in “extended thinking,” which Anthropic says improves math, science, and coding, even when the company describes the feature as product behavior rather than a training recipe. (anthropic.com 1) (anthropic.com 2) So when people say reinforcement learning is getting reasoning data, they mean the industry has found a new fuel source: problems with answers that can be verified by a machine. If pretraining taught models to sound like they know things, this newer phase is teaching them to work through problems where being right is measurable. (openai.com) (arxiv.org)

RL is getting reasoning data

Get your own daily briefing