Eval vs agent: teams favor AI workers
- Anthropic’s widely shared “Building Effective AI Agents” guide became the clearest marker of a shift: teams are shipping workflows and specialist subagents, not free-range agents. - The key design rule is simple but telling — start with the smallest possible system, and only add autonomy when fixed paths fail. - That matters because “agent” hype is giving way to operations reality: observability, evals, and audit trails now beat broad autonomy.
AI teams are getting less romantic about agents. That’s the real story here. The flashy idea was a system that could take a goal, choose tools, plan steps, and just go do the work. But the pattern that keeps showing up in actual deployment advice is narrower — break the job into small, legible units, then evaluate each unit hard. Anthropic’s late-2024 guide became the cleanest statement of that view, and newer docs from OpenAI and LangChain mostly point in the same direction. ### What changed? The change is less a single launch than a consensus hardening in public. Anthropic said the most successful teams were using “simple, composable patterns” rather than complex frameworks. OpenAI’s current agent-building material leans heavily on guardrails, tracing, and orchestration instead of promising that one autonomous loop should run everything. LangChain’s docs make the split explicit — workflows use predetermined code paths, while agents are dynamic and harder to control. (anthropic.com) ### What do people mean by “workers”? Basically, a worker is a small AI component with one job. Maybe one model classifies an email, another drafts a reply, another checks policy, and a final evaluator decides whether the output is good enough to ship. That is still “agentic” in the broad sense, but it is not one broad, self-directing actor roaming across tools and state. It is closer to a production pipeline with LLM-shaped steps. LangChain describes this as dividing hard problems into tractable units that can be evaluated separately. (anthropic.com) ### Why are teams backing away from full agents? Because demos lie. A broad agent looks great in a sandbox, but production turns it into a messy distributed system with permissions, retries, hidden costs, and failure modes that are hard to reproduce. OpenAI’s guide frames agents as high-independence systems, but it also spends a lot of time on safe tool design and orchestration. Anthropic’s advice is even blunter — keep things simple, because complexity compounds fast. (langchain.com) ### Where do evals fit in? Evals are the reason the “worker” pattern is winning. If one component has one job, you can measure whether it did that job well. You can regression-test it, swap models, compare prompts, and know what broke. Once one agent is planning, calling tools, revising goals, and handing off to other agents, the blame graph gets fuzzy. LangChain’s multi-agent material makes the upside plain — specialized agents can be improved individually without breaking the whole application. (cdn.openai.com) ### Is anyone still building agents? Yes — but the practical versions are more constrained than the hype implied. OpenAI’s SDK supports handoffs, tracing, and guardrails. Anthropic’s framework includes workflows, routers, and evaluator-optimizer patterns before you get to more autonomous setups. The throughline is not “never use agents.” It is “earn autonomy step by step.” ### So what’s the actual rule? (langchain.com) Start with the smallest system that can do the job. If a fixed workflow works, use that. If the task is too open-ended for a fixed path, add a little autonomy. If one model cannot hold the whole problem, split it into specialists. That is the emerging operating rule across the best technical guidance, and it is a lot less glamorous than the agent discourse on social media. (developers.openai.com) ### Why does this matter beyond AI teams? Because this is the difference between a cool prototype and software a company will trust with money, customers, or internal operations. The market is not abandoning agents. It is redefining them into auditable systems with narrower responsibilities, better traces, and clearer tests. The bottom line is simple — teams still want AI to do work, but they increasingly want that work done by supervised specialists, not one autonomous generalist. (anthropic.com) (langchain.com)