NEO releases agent failure classifier
- NEO surfaced an open-source Agent Failure Classifier this week — a post-hoc debugging tool that reads agent traces and labels why runs went wrong. - The package tags eight failure modes, pinpoints the first bad turn, and ships as both a Python library and CLI with HTML reports. - That matters because agent teams still debug production failures by hand, even as trace volume and tool complexity keep rising.
Agent debugging is getting its own linting layer. That is basically what NEO just shipped with its Agent Failure Classifier — a small but useful tool for teams running tool-using LLM agents in production. Instead of staring at a long trace and guessing what broke, you feed the run into the classifier and it labels the failure, points to the first turn where things went sideways, and spits out a structured report. (github.com) ### What did NEO actually release? NEO released an open-source Agent Failure Classifier as both a Python library and a command-line tool. The repo describes it as a post-hoc root-cause analysis tool for failed or low-quality LLM agent runs, and the demo material shows it taking in traces made of user turns, tool calls, model responses, and final output(github.com)across many traces. (github.com) ### What problem is it trying to solve? The annoying part of agent ops is not just that agents fail — it is that they fail in messy, overlapping ways. A bad answer might be a hallucination, but it might also come from using the wrong tool, losing track of prior steps, or drifting off the original goal halfway through a long chain. Human reviewers can u(github.com) real traffic and recurring incidents. The classifier is trying to turn that fuzzy postmortem work into something machine-readable. (github.com) ### Which failures does it recognize? The tool uses exactly eight labels: hallucination, tool misuse, context loss, circular reasoning, goal drift, over-refusal, schema error, and timeout cascade. Those categories are concrete enough to be operational. “Hallucination” covers unsupported factual claims or calls to tools that do not exist. “Tool misuse” (github.com)oss” is repeated work or forgotten prior decisions. The rest cover loops, sub-goal obsession, unnecessary refusal, malformed structured output, and a slow tool call knocking the rest of the run off balance. (github.com) ### How does it decide? The repo says it uses a hybrid setup — fast rule-based detectors first, with an optional LLM-as-judge pass through OpenRouter. That design choice matters. Pure LLM judging is flexible but expensive and inconsistent. Pure rules are cheap but brittle. NEO is splitting the difference: use deterministic checks for obvious failure signatures, then add a model pass when the trace needs interpretation. (github.com) ### Why is “first bad turn” useful? Because most agent traces are long enough to hide the real mistake. The final wrong answer is often just the visible symptom. The useful moment is earlier — the first bogus tool call, the first malformed JSON blob, the first repeated step. If a classifier can reliably mark that point, teams can wire it into observabi(github.com)ng every failure like a fresh mystery. That is the practical angle here. (dev.to) ### Is this just a taxonomy exercise? Not really. There is a broader push right now to formalize agent failure modes, including Microsoft’s recent taxonomy work on agentic AI safety and security. But NEO’s release is more tactical than academic. It is not trying to map the whole universe of agent risk. It is giving builders a tool they can run on traces today and plug into existing workflows. (cdn-dynmedia-1.microsoft.com) ### What is the catch? The catch is coverage. Eight labels are useful, but real production failures are often mixed cases — bad retrieval causes hallucination, or latency triggers tool misuse and then goal drift. So this is not a complete explanation engine. It is a triage layer. Still, triage is exactly what many agent teams are missing. (github.com) ### Bottom line NEO did not solve agent reliability. But it shipped something narrower and more immediately valuable — a way to turn messy traces into standardized failure labels. For teams already drowning in agent logs, that is a real step from “we know it broke” to “we know how it broke.” (github.com)