MiniMax Details In-House Expert RLHF Workflow

Olive Song of MiniMax revealed the company uses its own expert developers as reward models for its M2 agent, creating tight feedback loops for coding tasks. The training process interleaves model actions with environmental perturbations, like shuffling tool access, to improve generalization. The team also found reinforcement learning required higher FP32 precision to avoid reward hacking, emphasizing the need for engineering rigor in model development.

- Anthropic's Constitutional AI offers an alternative to pure RLHF by first having a model critique and revise its own outputs based on a set of principles, or a "constitution." This creates a preference dataset from AI-generated feedback (RLAIF) to train a model that is helpful and harmless, reducing the reliance on costly and potentially inconsistent human labeling. - Evaluating agentic AI systems requires specialized benchmarks beyond traditional language model metrics. Frameworks like AgentBench, WebArena, and SWE-bench test agents on their ability to perform multi-step tasks, use tools, and solve real-world software engineering problems, creating a need for high-quality, task-specific evaluation data. - The cost and scalability of sourcing high-quality human feedback is a significant bottleneck for AI labs. Challenges include annotator subjectivity, fatigue, the need for deep domain expertise for complex tasks, and the potential for introducing biases, which can degrade model performance. - Newer alignment techniques are moving towards more automated and efficient methods. Direct Preference Optimization (DPO) bypasses the need for a separate reward model entirely by directly optimizing the language model on preference pairs. Unsupervised preference alignment is an emerging research area that aims to create preference data without human or AI supervision, for instance by using augmented data to generate "hard negative" examples. - Venture capital funding for AI startups surged in 2025, capturing nearly 50% of all global venture investments. A significant portion of this capital flowed to AI infrastructure companies, including foundation models and developer tools, with investors making fewer but larger bets on scalable technologies. - A go-to-market strategy for an AI infrastructure startup must focus on a well-defined Ideal Customer Profile (ICP), such as ML engineers or researchers at specific labs. The strategy should detail a clear value proposition, distribution channels, and a sales model designed to engage highly technical buyers, often requiring a deep understanding of their workflows and pain points. - The rise of sophisticated AI systems is creating new categories of human-in-the-loop work. While some data labeling tasks may be automated, the need for expert-level feedback, red-teaming, and evaluation on complex, multi-step agentic tasks is growing, shifting the nature of the data annotation workforce toward more specialized skills.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.