GRPO cuts RL token cost
- A high-throughput RL recipe using end-to-end FP8 precision with GRPO and two-phase training claims to slash continual-learning costs for reasoning models. - Napkin math in the thread pegs continual-learning at about $65 per million tokens, roughly 10× cheaper than typical RL approaches, but only if you sustain massive token throughput. - The poster warns the method is practical mainly for large labs with huge token streams and dedicated infra. (x.com)