A Backlash Against RLHF Is Brewing

Prominent AI researcher Alan Mathison is sparking debate by arguing that Reinforcement Learning from Human Feedback (RLHF) is "badly harming the models." In another post, he linked the technique to AI "suffering," tapping into growing concerns among alignment teams about the unintended consequences of post-training methods on model behavior.

RLHF is a multi-stage process that begins with a pre-trained model, which is then fine-tuned using a high-quality dataset created by human experts. Following this, a separate "reward model" is trained by having human annotators rank different model outputs, teaching the AI to predict human preferences. Finally, the language model is optimized to generate responses that maximize the predicted human satisfaction score from the reward model. The need for high-quality, large-scale human feedback makes RLHF costly and difficult to scale. This has led to the development of alternatives like Constitutional AI, pioneered by Anthropic. Constitutional AI trains models to critique and correct their own outputs based on a predefined set of ethical principles, reducing the reliance on constant human supervision. Other emerging techniques include Direct Preference Optimization (DPO), which simplifies the process by directly optimizing the language model against preference data, and Reinforcement Learning from AI Feedback (RLAIF), which uses an AI model to generate preference labels. Data quality is a critical factor in the success of any AI training pipeline, with poor data being a common reason for project failure. Key dimensions of data quality include accuracy, completeness, consistency, and the absence of bias. To maintain this quality, labs employ continuous data monitoring and validation, often using automated tools to detect anomalies and ensure data integrity throughout the model's lifecycle. As AI systems become more autonomous, evaluating their performance requires new benchmarks that go beyond simple accuracy. For these "agentic" AIs, benchmarks like AgentBench and WebArena test their ability to perform multi-step tasks, use tools, and navigate complex environments. These evaluations measure task success rates, efficiency in terms of cost and speed, and the accuracy of actions taken. To address the bottleneck of sourcing human-labeled data, many AI labs are turning to synthetic data generation. This involves using AI to create artificial datasets that mirror the statistical properties of real-world data, which is particularly useful for scenarios where real data is scarce or sensitive. This approach can accelerate development timelines and reduce data acquisition costs significantly. The demand for high-quality, specialized data is shifting the data labeling workforce away from low-skill gig work towards domain experts. Instead of simple image tagging, the focus is now on recruiting professionals like doctors, lawyers, and coders who can provide nuanced, context-rich annotations for training frontier models. This evolution highlights the growing need for a skilled workforce to handle complex data requirements. For AI infrastructure startups, the go-to-market strategy is shifting from selling tools to selling transformation. The sales process for technical products often involves educating a complex buying committee that includes stakeholders from data science, legal, and finance. Successful strategies focus on demonstrating tangible business outcomes and building a strong business case for adoption. The fundraising climate for AI companies has been exceptionally strong, with AI-related startups attracting a significant portion of global venture funding. In 2024, AI startups raised over $100 billion, with a notable portion of this investment flowing into AI infrastructure and data provisioning companies. This trend includes higher valuations at all funding stages compared to non-AI startups, indicating strong investor confidence in the sector.

A Backlash Against RLHF Is Brewing

Get your own daily briefing