RLHF shifts to trajectory-level feedback

The technical requirements for Reinforcement Learning from Human Feedback (RLHF) data are evolving beyond single-turn output rankings. AI labs now increasingly require nuanced human feedback across entire agent trajectories. This includes evaluating multi-step reasoning, tool use, and context-aware decision-making, as demonstrated in recent research applying reinforcement learning to complex, iterative tasks.

- Trajectory-level feedback is crucial for training agentic AI systems, as evaluation moves beyond single outputs to assessing multi-step reasoning, tool use, and error recovery across entire workflows. New benchmarks like TurnBench and ProcBench are being developed to specifically measure these multi-turn, multi-step reasoning capabilities where current models still struggle. - Anthropic's Constitutional AI is a key technique in this shift, using a predefined set of principles to enable the model to critique and revise its own outputs, reducing the reliance on constant human feedback for every decision. This self-correction method is designed to make the AI more helpful, harmless, and honest without the scalability bottlenecks of traditional RLHF. - The demand for human data is shifting from quantity to quality, with a focus on domain experts—like lawyers, doctors, and software developers—who can provide nuanced feedback on complex tasks. This move away from gig-worker-style data labeling addresses the need for high-context annotations required by frontier models. - While human feedback is essential for defining tasks and evaluating nuanced performance, synthetic data is increasingly used for scaling up training datasets, especially where models exceed human reliability. The "LLM-as-a-Judge" method, where a powerful model evaluates the outputs of another, is a common application of synthetic feedback for scaling evaluations. - The data labeling workforce is evolving from entry-level task execution to more specialized roles like quality control analyst and AI trainer, requiring investment in upskilling and creating defined career paths. However, this global workforce, estimated to be between 150 and 430 million people, often faces poor working conditions, highlighting the need for fair labor practices. - For AI infrastructure startups, the go-to-market strategy must focus on demonstrating how their data solutions solve specific revenue-impacting problems for technical buyers, rather than just offering a general AI tool. Successful strategies often pinpoint and automate repetitive, low-value tasks within a company's existing workflows to show immediate value. - The fundraising climate for AI companies remains strong, with AI-focused startups capturing nearly half of all global venture funding in 2025. AI infrastructure is a significant area of investment, with median Series B valuations for AI startups reaching $143 million, indicating strong investor confidence in the foundational layers of the AI stack.

RLHF shifts to trajectory-level feedback

Get your own daily briefing