Stop vibe‑testing agent evals
- OpenAI, Anthropic, LangChain, Google, and Microsoft are converging on the same message in 2025 and 2026: stop “vibe testing” agents and start running fixed eval suites. - The shared pattern is specific: score traces, tool choice, argument correctness, final answers, latency, cost, and safety separately — then run them in CI. - That matters because agents are multi-step systems now, and sampling a few chats misses regressions that only show up in trajectories.
Agent evals are having their “unit tests, not demos” moment. The shift is simple, but it matters a lot: if you judge an agent by poking it a few times and seeing whether it feels smart, you will miss the bugs that actually break production. Over the last year, the big platform and tooling teams have started saying that part out loud. OpenAI, Anthropic, LangChain, Google, and now Microsoft are all pushing some version of the same idea — fixed datasets, explicit graders, trace inspection, and CI instead of vibes. ### What’s wrong with vibe testing? Vibe testing is what most teams do first. They run ten prompts, watch the agent do something cool, and decide it’s good enough. But agents fail in ways that are hard to spot casually. A final answer can look fine while the tool path was wasteful, unsafe, or fragile. And because agents act across many turns and modify state, one small mistake early can compound into a bad outcome later. That is exactly why Anthropic and LangChain both frame agent evals around trajectories, not just outputs. (developers.openai.com) ### Why are agents harder than plain chatbots? A chatbot mostly gives you one answer. An agent chooses tools, formats arguments, reads results, updates its plan, and then answers. That means there are several places to fail. The model can pick the wrong tool. It can call the right tool with bad arguments. It can interpret the tool output badly. Or it can get the answer right in a ridiculously expensive or slow way. OpenAI’s agent-evals docs basically treat these as separate surfaces to measure, which is the right mental model. (anthropic.com) ### So what should you score? The common recipe is becoming pretty clear. Score the final answer, yes, but also score the run itself. Was the tool selection correct? Were the arguments well formed? Did the agent follow the expected path? How long did it take? How much did it cost? Did it refuse when it should have refused? LangChain’s docs call out final response and trajectory evals directly. Google’s multi-agent eval lab adds tool-use quality. Microsoft’s new Copilot Agent Evaluations CLI emphasizes structured reports for development loops and CI/CD. (developers.openai.com) ### Why split those scores apart? Because one blended score hides the bug. If quality drops, you need to know whether retrieval got worse, tool routing drifted, or the model started over-calling expensive tools. Anthropic makes a similar point in its post on infrastructure noise: the same agent can score differently depending on resource configuration, so collapsing everything into one number makes comparisons hard to interpret. Separate metrics make failures traceable. (docs.langchain.com) ### What does “fixed eval suite” really mean? It means you stop changing the test every time you want to feel better. You build a representative dataset of tasks that match real usage, define success criteria, and run the same checks over time. Then model swaps, prompt edits, tool changes, and orchestration tweaks all get measured against the same baseline. OpenAI’s eval materials describe this as captured runs plus checks you can compare over time. That is basically regression testing for agent behavior. (anthropic.com) ### Where does tracing fit in? Tracing is the bridge between “the score went down” and “here’s why.” If an eval says tool-use quality slipped, traces show the exact turn where the agent picked the wrong function or malformed an argument. OpenAI’s agent workflow docs say to start with traces when you are still debugging behavior. Langfuse and LangSmith make the same bet — record the internal steps, then grade them offline or online. (developers.openai.com) ### Does this kill human judgment? No — it just moves human judgment to where it helps. Humans are still useful for defining the dataset, writing rubrics, reviewing weird failures, and checking whether the eval still matches the product. But the day-to-day gate should be repeatable. Otherwise every launch decision becomes “I tried it and it seemed okay,” which is not a real quality bar. ### What’s the bottom line? (developers.openai.com) The industry is settling on a grown-up way to test agents. Don’t demo them into production. Treat them like multi-step software systems — with traces, failure categories, fixed eval suites, and CI gates. The vibe test still has one job: generating hypotheses. It should not be the scoreboard. (langchain.com)