Agent Evaluation Demands 'Trajectory-Level' Data

Evaluating AI agents now requires analyzing the entire task pathway, not just the final result. Labs are shifting to "trajectory-level" annotation, which involves flagging every decision, tool call, and revision an agent makes. This complex, multi-step feedback is essential for testing an agent's ability to decompose tasks and recover from errors.

The shift to trajectory-level evaluation is creating new benchmarks that move beyond final outcomes. Instead of only measuring task success, benchmarks like WebArena, AgentBench, and GAIA now assess an agent's performance across multi-step web browsing, software development, and operating system tasks. These complex environments require evaluating the entire process, including the efficiency of actions taken, the ability to recover from errors, and the safety of each decision. This detailed feedback is a core component of Reinforcement Learning from Human Feedback (RLHF), a technique used to align models with human intent. However, RLHF faces challenges with the subjectivity, inconsistency, and potential biases of human annotators, which can be difficult to scale. The high cost and logistical complexity of sourcing and managing a large, diverse group of human labelers is a significant operational bottleneck for AI labs. To address the scalability issues of RLHF, some labs are turning to Reinforcement Learning from AI Feedback (RLAIF). In this process, a separate AI model, guided by a "constitution" or set of rules, generates preference data, which can be faster and cheaper than using human annotators. While RLHF is often considered the gold standard for grounding models in human values, RLAIF offers a way to accelerate training and expand feedback at scale, with many seeing a hybrid approach as the most likely future. Synthetic data is also emerging as a critical tool for training and testing agents, especially when real-world data is scarce, sensitive, or expensive to label. Generative models can create vast datasets covering diverse scenarios, edge cases, and even adversarial examples to test agent robustness. However, ensuring the generated data accurately preserves the statistical properties and structural relationships of real data is a key challenge requiring careful validation. For data labeling startups, this evolution means moving beyond simple annotation. The demand is shifting from low-skill gig workers to high-context, domain-specific experts like doctors, lawyers, and coders who can provide nuanced feedback on complex tasks. This creates an opportunity for businesses that can recruit, train, and manage specialized talent, turning data labeling into a more strategic function within the AI development lifecycle. The fundraising climate for AI infrastructure startups reflects this intense demand, with venture capital investment in AI reaching over $100 billion in 2024, a significant increase from the previous year. A large portion of this funding is directed towards infrastructure and data provisioning companies that support AI operations. Investors are placing larger, more concentrated bets on companies that provide the foundational tools and data necessary to build and align advanced AI systems.

Agent Evaluation Demands 'Trajectory-Level' Data

Get your own daily briefing