RL preserves base-model features
- MIT researchers posted and presented “RL’s Razor,” an ICLR 2026 paper arguing on-policy reinforcement learning forgets less than supervised fine-tuning on new tasks. - The key claim is mechanistic: forgetting tracks KL divergence from the base policy, and RL tends to find lower-KL solutions than SFT. - That matters because post-training may not just add skills — it can preserve or overwrite the base model’s reusable capabilities.
Post-training is the part of LLM building where teams try to make a general model actually useful. You take a base model that knows a lot, then push it toward a task with supervised fine-tuning or reinforcement learning. The problem is that these methods do not just add behavior. They can also erase behavior. The new wrinkle is a cluster of recent papers arguing that RL often preserves more of the base model than SFT does — and that this is not a vibe, but a measurable property of how far the model moves. (openreview.net) ### What changed this week? The specific thing people were reacting to is “RL’s Razor,” a paper by Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal that was accepted as an ICLR 2026 poster. The paper says online RL can match SFT on a new task while preserving prior capabilities much better. It also proposes a simple rule for predicting forgetting: look at the KL divergence between the fi(openreview.net)hift, less forgetting. (openreview.net) ### What does “preserve the base model” actually mean? It means the model still knows how to do other things it knew before fine-tuning. A base LLM has a big bank of representations, heuristics, and latent skills from pretraining. If post-training pushes the model too hard toward one narrow answer style, those older capabilities can get overwritten or become harder to access. That is catastrophic forgetting in plain English. (openreview.net) ### Why would RL forget less than SFT? Basically, SFT tells the model “imitate these target outputs.” There can be many parameter settings that fit those outputs, including ones that drift far from the original model. On-policy RL works differently. It starts from the model’s own current behavior, samples from it, and improves reward from there. That setup creates a bias toward solutions(openreview.net)policy. That is the core idea behind RL’s Razor. (openreview.net) ### Is there evidence beyond one paper? Yes — but it is mixed in an interesting way. A 2025 paper called “SFT Memorizes, RL Generalizes” found RL generalized better than SFT on unseen rule variants and visual variants, while SFT tended to memorize training patterns. Another 2025 paper found RL fine-tuning often updates only a sparse subnetwork — roughly 5% to 30% of parameters — leaving (openreview.net)m for why more pretrained capability survives. (arxiv.org) ### So does RL always beat SFT? No — and this is the part people skip. Another 2025 paper argues RL is not magic. It can recover a lot of the out-of-distribution performance that SFT damages, but not all of it. If the SFT stage overfits badly and causes a large distribution shift, RL may not fully repair the damage afterward. In other words, RL looks better when it starts from a decent checkpoint, not a wrecked one. (arxiv.org) ### What is the practical takeaway for model builders? Use SFT and RL as different tools, not interchangeable ones. SFT is still useful for format learning, bootstrapping, and getting a model into the right response regime. But if you care about keeping the base model’s broad competence intact, the newer work suggests you should watch how far your post-training moves the policy — and consider RL when the task can be expressed with a reward. (arxiv.org) ### Why does this matter beyond benchmark scores? Because the industry increasingly wants one model to do many things over a long lifetime. If every new fine-tune overwrites old skills, you get brittle systems and expensive retraining loops. If RL really tends to make smaller, more targeted changes, then the win is not just higher scores. The win is a model you can keep adapting without constantly hollowing it out. (openreview.net) ### Bottom line? The emerging picture is not “RL good, SFT bad.” It is narrower and more useful: RL often seems to solve new tasks with less drift from the base model, and that may be why more old capabilities survive. For teams doing post-training, that shifts the question from “which method is stronger?” to “which method gets the behavior we want without rewriting the model more than necessary?” (openreview.net)