Arize demos hands-on agent evals
- Arize published a May 14 talk in which Laurie Voss argued teams should gate agent releases with task-based evaluations instead of benchmark scores alone. - Laurie Voss showed one financial-analysis agent scoring 0 of 13 on correctness while scoring 13 of 13 on faithfulness in the same workshop. - Arize is promoting related agent-evaluation materials on its docs site, blog and June 4 Arize:Observe event. (youtube.com)
Arize used a May 14 workshop to make a narrow point about agent testing: benchmark scores and a few manual spot checks are not enough to decide whether an agent is ready for production. In the video, Laurie Voss, Arize’s head of developer relations, walked through an evaluation pipeline for a financial-analysis agent and argued that teams should test agents against the tasks they are expected to complete in deployment. The workshop was published on YouTube under the title “Ship Real Agents: Hands-On Evals for Agentic Applications.” (youtube.com) The talk fits a broader Arize push around agent evaluation. Arize’s website says its platform is built around development, observability and evaluation for AI applications and agents, and its recent documentation and blog posts describe production tracing, online and offline evals, and CI/CD workflows for agent systems. ### What was Voss arguing against? Laurie Voss framed the target as what he called the “vibes problem” — testing an agent by running a handful of prompts and deciding it “looks right.” The YouTube description says that approach does not catch regressions, does not run in continuous integration, and does not show whether a prompt fix broke other workflows. (youtube.com) Arize’s May 5 blog post makes the same case in longer form. In that post, company employees wrote that early testing for their Alyx agent relied on a Google Doc, manual checks and repeated reruns, and that small prompt or tool-description changes could create cascading failures that were hard to predict. (arize.com) ### What did the demo actually show? The May 14 workshop used a financial-analysis agent as the running example. (youtube.com) According to the video description, Voss started with tracing in Phoenix, reviewed traces before writing evals, categorized failures by root cause, and then built code-based evals, built-in LLM-as-a-judge evals and a custom rubric with labeled examples. One example supplied the clearest contrast. The same agent scored 0 out of 13 on a correctness eval and 13 out of 13 on a faithfulness eval, the description says, because the model “doesn’t know what year it is” and could not verify forward-looking financial data. (arize.com) That result was presented as evidence that the choice of eval can change what a team thinks it has measured. ### Which kinds of agent failures was Arize focused on? (youtube.com) Arize’s agent-evaluation documentation says agents need to be evaluated not only on final answers but on “what it knows,” “the set of actions it can perform,” and “the pathway it took to get there.” The docs list templates for tool calling, tool selection, parameter extraction, path convergence, planning and reflection. That framing maps to the failure modes Arize has been describing across its materials. (youtube.com) The company’s docs say agent test cases should cover missing context, short and long context, cases where no function should be called, cases with one or multiple function calls, and both single-turn and multi-turn pathways. Its May 5 blog post says real production traces become test cases when something breaks, so teams can rerun the exact conditions that caused a failure. (arize.com) ### Why does Arize keep emphasizing traces and datasets? Arize’s blog says the “foundation” of its testing approach is capturing real production traces and turning them into golden datasets. The company says synthetic data tends to overrepresent happy-path examples, while production traces capture the failures and edge cases users actually generate. The workshop description pointed to the same sequence: trace first, inspect failures, then write evals. (arize.com) Arize’s platform materials also describe evals, prompts and experiments as reusable development assets, with CI/CD experiments intended to catch prompt and agent regressions before release. ### Where does this leave teams looking for the next step? Arize is directing users to several follow-on resources. Its documentation includes agent-evaluation concepts and templates, while its May 5 blog post expands on testing agents in production with traces, golden datasets and CI/CD. (arize.com) The company’s homepage also says Arize:Observe is scheduled for June 4. Laurie Voss remains one of the named presenters attached to Arize’s evaluation materials. (youtube.com) Arize’s courses page lists him as the instructor for “LLM Evaluation Basics,” and the May 14 workshop remains available on YouTube as the company’s most direct hands-on example of the argument that agent releases should be gated by task-based evals. (arize.com 1) (arize.com 2)