Common agent failure modes & fixes
Experts are converging on a short list of failure modes—hallucinations, context drift, retry traps, tool brittleness and cost spirals—and practical fixes like checkpoint validation, structured memory and circuit breakers are reappearing in production playbooks outlined. The same posts stress building eval loops and redundancy for brittle tools to prevent single-point breakdowns in multi-step agent chains noted.
Galileo published a structured seven‑mode taxonomy that maps failures across Memory→Reflection→Planning→Action on Nov 1, 2025, and shows how early state corruption cascades downstream. (galileo.ai) Microsoft’s AI Red Team released a taxonomy and April 24, 2025 whitepaper that highlights memory‑poisoning case studies and recommends semantic validation for persistent agent state. (microsoft.com) A Feb 19, 2026 operational playbook from Oneuptime prescribes per‑step distributed traces and OpenTelemetry spans because a single user query can trigger as many as 15 LLM calls, creating hidden cost and latency variance. (oneuptime.com) Comparative write‑ups from LangChain and Galileo list LangSmith, Langfuse, Helicone and Datadog as leading observability/eval platforms and note OpenTelemetry as the emerging vendor‑neutral standard for trace instrumentation. (langchain.com) Production autopsies show real dollar impact: a LangGraph production loop consumed about $200 in API charges in a single incident, while field reports document smaller but frequent spikes such as $40/hour runaway retries during provider outages. (markaicode.com) Expedia Group published new GenAI APIs and partnership launches during its May 14, 2025 EXPLORE event (including Trip Matching and integrations with Copilot/Operator), and its press materials claim the Reservation Management API can save hotels an estimated 8 million hours and $120M annually. (businesswire.com) Recent engineering literature and packages are converging on guardrails—academic circuit‑breaker techniques (Representation Rerouting) and an open AgentCircuit runtime decorator (released Feb 4, 2026) pair with checkpoint/resume memory patterns and multi‑provider fallbacks as practical defenses against memory poisoning, infinite retries, and tool brittleness. (arxiv.org)