Labs Pursue 'Modular Intelligence' for Agents

AI21 Labs and Google DeepMind are converging on a 'modular intelligence' approach for agent orchestration. This human-like model involves agents that can coordinate, plan, and delegate tasks among themselves, creating a need for more dynamic, scenario-based evaluation beyond traditional static benchmarks.

AI21 Labs’ modular approach, known as MRKL (Modular Reasoning, Knowledge and Language), combines large language models with external, discrete reasoning modules like calculators or databases. This neuro-symbolic architecture allows an AI system to route tasks to the best tool for the job, aiming to reduce factual errors and hallucinations common in monolithic LLMs. The company's "Maestro" orchestration system reportedly halves hallucination rates in enterprise use cases. Google DeepMind is also developing agentic AI that can discover and optimize complex algorithms, a system it calls AlphaEvolve. Used internally, AlphaEvolve has already been applied to improve the efficiency of Google's data centers and speed up the FlashAttention kernel for transformer models by a claimed 32.5%. This reflects a broader industry shift from building general models to creating specialized, synthetic experts that acquire verifiable skills. Evaluating these sophisticated, multi-step agents requires moving beyond static benchmarks like MMLU. A new generation of "stateful" benchmarks has emerged, such as WebArena, AgentBench, and GAIA, which test agents on their ability to perform tasks in interactive environments like web browsers, databases, and operating systems. These benchmarks assess an agent's reasoning and tool-use capabilities in dynamic scenarios, which is crucial for real-world deployment. Training these agents often relies on Reinforcement Learning from Human Feedback (RLHF), but the cost and inconsistency of human supervision at scale is a major bottleneck. This has led to the development of Constitutional AI, an approach pioneered by Anthropic where an AI critiques its own outputs based on a set of principles, reducing the dependency on human raters for every single task. This method, also known as Reinforcement Learning from AI Feedback (RLAIF), allows for more scalable and transparent alignment. The decision between using synthetic or human-labeled data is now a core strategic choice for AI labs. While synthetic data offers speed and scalability, it can't surpass the quality of its "teacher" model and often lacks the noise and unpredictability of real-world data. Research shows that while up to 90% of a training set can be synthetic, the final 10% of human-labeled data is often indispensable to prevent significant performance declines, especially for context-sensitive tasks. For AI infrastructure startups, this creates a specific go-to-market challenge: selling to highly technical buyers. Success requires a GTM strategy that focuses on metrics like "Return on AI Investment" (ROAI) and demonstrates tangible ROI for technology often seen as experimental. Companies using AI in their GTM strategies report 35% higher win rates and a 25% reduction in customer acquisition costs. The rise of agentic AI is reshaping the workforce, with new roles emerging to oversee AI operations, governance, and risk. IDC predicts that by 2027, half of all AI-enabled enterprise applications will require new oversight positions. This shift moves away from AI as a "co-worker" to AI as a sophisticated tool that requires human systems for accountability and to turn automation into a competitive advantage.

Labs Pursue 'Modular Intelligence' for Agents

Get your own daily briefing