Study: RL Fails to Boost Small LLM Math Skills

A new technical analysis found that reinforcement learning (RL) fails to meaningfully improve the mathematical reasoning of small (0.6B parameter) LLMs. The experiment showed that for these models, basic supervised fine-tuning was more effective, suggesting RL isn't a magic bullet for adding quantitative skills to smaller, cost-effective agents.

The study's focus on a 0.6B parameter model highlights a critical tradeoff in AI development: cost versus capability. Smaller models are faster and cheaper to run, making them ideal for on-device or real-time financial applications, but they often lack the reasoning depth of their larger counterparts. This research reinforces the idea that simply applying techniques designed for massive models, like reinforcement learning, may not scale down effectively. The specific method mentioned, Group Relative Policy Optimization (GRPO), is a variant of the more common Proximal Policy Optimization (PPO). GRPO was designed to be more efficient by removing the "critic" model that PPO uses to judge outcomes, instead comparing results from a group of attempts to find the best path forward. While promising for large models, this study suggests its benefits don't translate to smaller models for mathematical tasks. For smaller models, the limited number of parameters can be a bottleneck for complex reasoning. These models have less capacity to store the intricate patterns needed for multi-step mathematical problems. Supervised fine-tuning (SFT), which directly teaches the model with correct input-output examples, proves more effective because it provides explicit, high-quality data that the smaller model can more easily absorb. This finding is crucial for developers in quantitative finance building custom AI agents. Relying on smaller, fine-tuned models for specific, narrow tasks—like data extraction or sentiment analysis—can be more cost-effective than using a large, general-purpose model. Techniques like knowledge distillation, where a large model "teaches" a smaller one, also offer a path to imbue small models with advanced capabilities without the high cost of RL. The challenge isn't just about training; it's about inference speed and cost in live trading or analysis systems. A 0.6B model has significantly lower latency than a 70B model, which is critical for real-time applications. The study implies that for quant developers, optimizing a portfolio of smaller, specialized, and efficiently-tuned models may yield better ROI than trying to force advanced reasoning into a single, small agent through RL. Ultimately, this points to a more nuanced strategy for building agentic systems in finance. Instead of a single, powerful AI, the future may involve a collaboration of models: a large model for high-level planning and several smaller, fine-tuned models to execute specific, computationally intensive tasks efficiently. This hybrid approach balances the reasoning power of large models with the speed and cost-effectiveness of smaller ones.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.