New Benchmarks Emerge for Agentic AI
Several new benchmarks are being introduced to evaluate complex, multi-step AI agents. Jenova.ai released a benchmark for long-context agentic orchestration, testing models on workflows with over 100,000 context tokens. Separately, the micro1 Cortex platform was launched for contextual testing in enterprise workflows, while OpenAI and Paradigm are testing agents against crypto exploits with EVMbench.
- The EVMbench benchmark, developed by OpenAI and Paradigm, is built from 120 high-severity vulnerabilities discovered across 40 different smart contract audits. It tests AI agents in three distinct modes: "detect" for identifying flaws, "patch" for fixing them without breaking functionality, and "exploit" for attempting to drain funds in a secure sandbox environment. In initial tests, OpenAI's GPT-5.3-Codex model performed significantly better in the "exploit" mode than earlier models. - Reinforcement Learning from Human Feedback (RLHF) is a key process for aligning models, but it is resource-intensive, often requiring the collection of tens of thousands of human preference labels to fine-tune a single model. The workflow typically involves using a pre-trained model, collecting human preference data on model outputs, training a "reward model" based on this feedback, and then optimizing the original model's policy to maximize the predicted reward. While reducing the need for massive, fully labeled datasets, the quality and consistency of the human labelers are critical bottlenecks. - Constitutional AI, an approach developed by researchers at Anthropic, aims to reduce the reliance on extensive human labeling for safety by providing the AI with a set of guiding principles or a "constitution". The model is trained to critique and revise its own outputs based on these rules, automating the feedback loop that is manually driven in traditional RLHF workflows. - A primary bottleneck in developing more capable AI is the availability of high-quality training data, as the growth in computational power has outpaced the creation of new, diverse datasets. This has led to models being "over-parameterized," where they memorize patterns from exhausted high-quality sources rather than generalizing, forcing labs to turn to synthetic data generation to supplement training. However, an overreliance on synthetic data can lead to models that don't perform well in real-world scenarios. - The micro1 Cortex platform is designed to evaluate enterprise AI agents by testing them within the context of real organizational workflows and internal data pipelines, moving beyond generic benchmarks. The platform sources and manages domain experts in fields like finance, HR, and legal to create evaluation data that reflects actual customer use cases for companies like Box. - While the venture capital landscape has become more cautious, AI infrastructure startups are attracting significant investment, with AI companies raising a third of all venture capital in 2024. Seed-stage AI startups saw median valuations 42% higher than their non-AI counterparts, and at the Series B stage, the median valuation for an AI startup was $143 million, 50% higher than for non-AI companies. - Go-to-market strategies for AI startups are shifting from a "growth at all costs" mindset to a focus on capital efficiency and demonstrating measurable ROI. Successful strategies now often involve performance-based pricing, using AI for hyper-personalization in marketing, and building trust by delivering verifiable value before a contract is signed. - The demand for data labelers is evolving from low-skilled gig work, such as labeling images for autonomous vehicles, to requiring high-context, domain-specific expertise from professionals like lawyers, doctors, and financial analysts. This shift is driven by the need to provide nuanced feedback for frontier models, with the largest AI labs now spending $1-2 billion annually on human-in-the-loop data pipelines.