Pydantic evals expose traceable regressions
- Pydantic’s evals tooling is getting real adoption as teams test LLM agents in Python, then inspect failures as traces in Logfire or other OpenTelemetry backends. - The key trick is typed, code-first grading: datasets, cases, evaluators, and span-based checks can validate outputs, tool use, timing, and failures. - That matters because agent bugs are often regressions, not crashes — behavior drifts across prompts, models, and tool chains.
Agent evals are turning into normal software engineering. That’s the real story here. Pydantic’s evals stack gives teams a way to test stochastic systems — LLM calls, agents, multi-step workflows — with typed Python code, then inspect exactly where a run went wrong through traces. The gap it fills is obvious once you’ve shipped anything agentic: the system still “works,” but a model upgrade, prompt tweak, or tool change quietly makes it worse. Now those regressions can show up in CI instead of in front of users. (pydantic.dev) ### What is Pydantic Evals, exactly? It’s a Python library for evaluating non-deterministic functions. That includes plain LLM calls, but also more complex agent systems with tools and multi-step execution. The design is code-first — you define datasets, test cases, expected outputs, and evaluators in Python rather than in a separate web UI. That matters because it lets evals live next to the code they protect. (pydantic.dev) ### Why use Pydantic for this? Because Pydantic is already about contracts. You define the shape of valid data, and the library checks whether reality matches that shape. In evals, that same instinct gets applied to agent behavior. A case has typed inputs. An output can be checked for structure, exact fields, or custom logic. Instead of “the chatbot seemed kind of off,” you get a failing assertion tied to a concrete case. (pydantic.dev) ### Why are regressions the big problem? Agent systems usually don’t fail like ordinary software. They don’t always throw an exception and stop. More often, they drift. A new model chooses the wrong tool 8% more often. A prompt edit makes formatting less reliable. A retrieval change slows the workflow enough to break a timeout budget. Those are real bugs, but they hide inside otherwise plausible ou(pydantic.dev)surable. Anthropic and OpenAI are both pushing the same broader pattern — use traces, graders, and datasets to catch behavior changes before production. (anthropic.com) ### What makes these regressions traceable? Tracing. Pydantic Evals records OpenTelemetry traces for each evaluation case, and those traces can be sent to Logfire or another compatible backend. That means a failed case is not just a red X on a dashboard. You can inspect the inputs, outputs, expected outputs, scores, execution duration, evaluator failures, and the under(anthropic.com)ution. Basically, you can see both that the agent failed and how it failed. (ai.pydantic.dev) ### Why is span-based evaluation a big deal? Because final answers are only half the story. Sometimes the output looks fine, but the path was bad — wrong tool, unnecessary call, slow branch, policy violation, wasted tokens. Pydantic explicitly supports span-based evaluation for internal agent behavior, not just end outputs. OpenAI’s agent eval docs make the same point fro(ai.pydantic.dev)ht tool or followed the right handoff path. (pydantic.dev) ### Does this replace unit tests? No — it extends them. Unit tests still own deterministic logic. Evals cover the fuzzy layer on top. The useful pattern is hybrid: deterministic assertions where you can have them, softer graders where you need them, and repeated datasets to compare runs over time. Anthropic frames this as matching the evaluation method to the system’s complexity. That’s basically the mature view. (anthropic.com) ### Why does this fit CI so well? Because the whole setup is code-native. You can run experiments in Python, print reports in the terminal, serialize results, store them, and compare runs over time. Logfire adds a web view for failed cases and run comparison, but the core workflow starts in code. That makes evals feel less like a side project and more like a normal release gate. (pydantic.dev) ### What’s the bottom line? The important shift is not “Pydantic has another feature.” It’s that agent reliability is getting a software-testing shape. Typed cases, explicit evaluators, and trace-level debugging give teams a practical way to catch regressions that used to feel slippery and anecdotal. For AI teams trying to ship without surprises, that’s a big step. (pydantic.dev)