MiniMax Releases Framework for Scalable Reinforcement Learning
What happened
AI research firm MiniMax has revealed a new framework named Forge for scalable reinforcement learning (RL) in real-world agent systems. The framework is designed to improve the throughput, stability, and flexibility of training industrial-grade embodied AI. Such advancements are critical for deploying robots that learn and adapt in dynamic environments.
Why it matters
- The Forge framework achieves a reported 40x training speedup by using optimized asynchronous scheduling strategies and a tree-structured merging strategy for training samples. - Its architecture is designed to be "agent-native," introducing an intermediary layer that completely decouples the training and inference engine from the agent's internal implementation, allowing it to work with arbitrary and even black-box agents. - For algorithmic stability with Mixture-of-Experts (MoE) models, the framework continues to use the CISPO (Cognitively Inspired Scheduled Policy Optimization) algorithm, which MiniMax introduced in earlier research. - To handle credit assignment challenges in long-context tasks, Forge employs a composite reward framework that includes "Process Rewards" for dense feedback on intermediate steps, rather than relying only on the final outcome. - The framework was battle-tested in the development of MiniMax's M2.5 model, which was trained across hundreds of thousands of distinct real-world environments and agent scaffolds. - Internally, the M2.5 model trained with Forge now automates 30% of real business tasks at MiniMax, and its generated code accounts for 80% of the company's newly committed code. - MiniMax was founded in 2021 by Yan Junjie, former VP at AI giant SenseTime, and is backed by major tech and venture capital firms including Alibaba, Tencent, and Hillhouse Capital. - The company is a major player in China's AI sector, having raised approximately $850 million in private funding before a successful Hong Kong IPO that raised an additional $619 million.
Key numbers
- - The Forge framework achieves a reported 40x training speedup by using optimized asynchronous scheduling strategies and a tree-structured merging strategy for training samples.
- The framework was battle-tested in the development of MiniMax's M2.5 model, which was trained across hundreds of thousands of distinct real-world environments and agent scaffolds.
- Internally, the M2.5 model trained with Forge now automates 30% of real business tasks at MiniMax, and its generated code accounts for 80% of the company's newly committed code.
- MiniMax was founded in 2021 by Yan Junjie, former VP at AI giant SenseTime, and is backed by major tech and venture capital firms including Alibaba, Tencent, and Hillhouse Capital.
What happens next
- For algorithmic stability with Mixture-of-Experts (MoE) models, the framework continues to use the CISPO (Cognitively Inspired Scheduled Policy Optimization) algorithm, which MiniMax introduced in earlier research.
Quick answers
What happened in MiniMax Releases Framework for Scalable Reinforcement Learning?
AI research firm MiniMax has revealed a new framework named Forge for scalable reinforcement learning (RL) in real-world agent systems. The framework is designed to improve the throughput, stability, and flexibility of training industrial-grade embodied AI. Such advancements are critical for deploying robots that learn and adapt in dynamic environments.
Why does MiniMax Releases Framework for Scalable Reinforcement Learning matter?
The Forge framework achieves a reported 40x training speedup by using optimized asynchronous scheduling strategies and a tree-structured merging strategy for training samples. Its architecture is designed to be "agent-native," introducing an intermediary layer that completely decouples the training and inference engine from the agent's internal implementation, allowing it to work with arbitrary and even black-box agents. For algorithmic stability with Mixture-of-Experts (MoE) models, the framework continues to use the CISPO (Cognitively Inspired Scheduled Policy Optimization) algorithm, which MiniMax introduced in earlier research. To handle credit assignment challenges in long-context tasks, Forge employs a composite reward framework that includes "Process Rewards" for dense feedback on intermediate steps, rather than relying only on the final outcome. The framework was battle-tested in the development of MiniMax's M2.5 model, which was trained across hundreds of thousands of distinct real-world environments and agent scaffolds. Internally, the M2.5 model trained with Forge now automates 30% of real business tasks at MiniMax, and its generated code accounts for 80% of the company's newly committed code. MiniMax was founded in 2021 by Yan Junjie, former VP at AI giant SenseTime, and is backed by major tech and venture capital firms including Alibaba, Tencent, and Hillhouse Capital. The company is a major player in China's AI sector, having raised approximately $850 million in private funding before a successful Hong Kong IPO that raised an additional $619 million.