RLHF Bottleneck: 80% of Compute is for Sampling

The biggest bottleneck in RLHF is no longer optimization but data throughput, with up to 80% of compute now spent on sample generation. This operational reality is forcing labs to orchestrate models across GPU clusters just to keep the data flowing, making the human feedback loop the primary limiting factor for scaling.

The heavy compute cost of sampling in RLHF stems from a multi-stage process: supervised fine-tuning (SFT), reward model (RM) training, and proximal policy optimization (PPO). Each stage involves running inferences and calculating gradients on large models, but the PPO phase is particularly demanding as the policy model must repeatedly generate responses to be evaluated by the reward model. This human data bottleneck is forcing a strategic shift toward alternatives like Constitutional AI (CAI), championed by labs like Anthropic. CAI replaces the slow and expensive human feedback loop with AI-driven feedback, where a model critiques and revises its own outputs based on a predefined set of principles or a "constitution." This Reinforcement Learning from AI Feedback (RL-AIF) approach is designed to be more scalable and consistent than relying on subjective human judgments. As models become more agentic—capable of multi-step reasoning and tool use—evaluation complexity skyrockets. Labs now rely on sophisticated benchmarks like AgentBench, which tests reasoning across eight different environments, and WebArena, which evaluates agents on their ability to complete complex tasks on live websites. These benchmarks create a need for high-quality, task-specific datasets that go far beyond simple preference pairs. To meet the voracious data needs of both RLHF and agentic systems, labs are increasingly turning to synthetic data. Generative models create artificial, statistically realistic data to solve for scarcity and privacy issues, with Gartner projecting that by 2030, 60% of all data used for AI will be synthetically generated. However, human expertise remains critical for validating synthetic data and for providing the nuanced, domain-specific feedback that models still require. The nature of data labeling itself is transforming from low-skill gig work to a high-value service requiring domain experts. To train models for specialized fields like medicine or law, AI labs now recruit doctors and lawyers to provide the necessary high-context annotations. This shift away from a commodity "assembly line" approach elevates the quality bar and creates opportunities for specialized data providers. For startups entering this space, the go-to-market strategy must be highly targeted. AI-native companies are seeing 35% higher win rates and a 25% reduction in customer acquisition costs by using AI to refine their own GTM. Success requires a deep understanding of the technical buyer and a clear value proposition focused on delivering high-quality, niche data that can solve specific alignment or evaluation challenges. The fundraising environment for AI infrastructure remains exceptionally strong, with investors concentrating capital into fewer, more promising companies. In the first two months of 2026 alone, 17 U.S.-based AI startups secured funding rounds exceeding $100 million each. Venture capital is flowing to companies that provide the essential "picks and shovels" for the AI gold rush, including scalable data and compute platforms. This evolution signals a broader change in the future of work, where AI automates repetitive tasks while creating demand for human experts to supervise, validate, and refine AI systems. The data labeling market is projected to reach $8 billion by 2028, driven by the need for human-in-the-loop pipelines and specialized data services to build more capable and trustworthy AI.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.