ICML workshop seeks agent failure submissions
- ICML 2026’s Failure Modes in Agentic AI workshop is soliciting papers on agent breakdowns, with submissions due May 9 UTC ahead of July’s Seoul event. - The workshop explicitly asks for reproducible triggers, trace-level diagnostics, and verified fixes — including strong negative results, not just benchmark wins. - That matters because agent evaluation is shifting from end-task scores toward observability, debugging, and repair inside long-horizon tool-using systems.
Agent research has a measurement problem. Models can look fine on a benchmark, then fall apart once they start looping through tools, memory, and multi-step plans in the real world. That gap is what this ICML 2026 workshop is trying to turn into a research agenda. The event — “Failure Modes in Agentic AI” — is now taking submissions, with the OpenReview deadline set for May 9, 2026 at 12:59 PM UTC, and the workshop itself scheduled for July 10 in Seoul. (openreview.net) ### What is this workshop actually about? The workshop is called “Failure Modes in Agentic AI: Reproducible Triggers, Trace Diagnostics, and Verified Fixes.” That title is basically the whole thesis. Instead of treating agent failures as embarrassing edge cases, the organizers want to treat them as first-class research objects — things you can define, trigger, inspect, compare, and maybe fix. The listed organizers on the ICML page are Manling Li and Zihan Wang. (icml.cc) ### Why focus on “agentic” failures? Because agents fail differently from chatbots. A single-turn model can hallucinate and be wrong in one shot. An agent can make a small mistake early, then compound it over ten more steps while touching tools, writing to memory, or acting on stale assumptions. The workshop description calls out error cascades, brittle tool use after interface changes, unstable memory read/write over time, weak re(icml.cc)ntraction — where optimization pushes behavior into rigid templates. (icml.cc) ### What kind of papers do they want? They want four things. First, operational definitions that pin down where a failure begins and which loop caused it. Second, minimal reproducible triggers — the smallest setup that reliably makes the system break. Third, comparable protocols with trace-level diagnostics, not just pass/fail outcomes. Fourth, mitigation and repair strategies that can actually be verified. That last bucket explici(icml.cc)nough to matter. (icml.cc) ### Why are negative results a big deal here? Because a lot of agent work still rewards demos over diagnosis. If a method fails, that often disappears into an appendix, a private eval dashboard, or an internal postmortem. This workshop is signaling the opposite: if you learned that a mitigation does *not* work, or only works under narrow conditions, that is useful evidence. Basically, they are trying to create a venue where “we fou(icml.cc)s. (icml.cc) ### What does “trace diagnostics” mean in practice? Think less final score, more flight recorder. The workshop wants evidence from inside the trajectory — which tool call failed, where memory drifted, when recovery logic kicked in, and whether the repair actually changed behavior. That is a shift away from judging agents only by terminal success. It lines up with a broader push in the field toward observability and failure attribut(icml.cc)ht paper, for example, framed failure attribution as identifying which agent and which step caused a task to fail. (icml.cc) ### Why now? Because ICML itself is leaning harder into agent reliability topics. The conference announced 44 workshops for 2026, and this one made the cut in a year with 247 workshop proposals. So this is not a random side event — it cleared a very competitive selection process. Meanwhile, the broader ecosystem is getting more concrete about agent failures, including taxonomies from major industry teams and more papers on collapse, attribution, and long-horizon brittleness. (blog.icml.cc) ### Who should care besides academics? Anyone building agent products. If your team cares about monitoring, debugging, evals, or rollback logic, this is the language you need. Reproducible triggers and trace diagnostics are how you move from “the agent sometimes does weird stuff” to “this failure appears under these conditions, in this loop, and this mitigation reduces it.” That is a much more operational frame than the usual leaderboard mindset. (icml.cc) ### So what’s the bottom line? The interesting part is not just that an ICML workshop wants submissions. It is that the workshop is trying to legitimize a different kind of result — failure cases, debugging traces, and fixes that can be checked. If that framing sticks, agent research may get a little less obsessed with polished wins and a lot more serious about how systems actually break. (icml.cc)