Reinforcement learning preserves base features

- A mechanistic interpretability analysis found RL fine-tuning changes LLM features far less than supervised fine-tuning, keeping base-model representations largely intact after training. - Interventions in the study showed those muted changes are causal: RL-tuned models generalize using preserved features while SFT produces specialized features that induce forgetting in downstream tasks. - The paper suggests reward-based tuning can retain foundational knowledge during updates, easing continual-learning concerns. (x.com)

Reinforcement learning preserves base features

Get your own daily briefing