Production gaps for LLM agents
Engineers are flagging infrastructure gaps that make LLM-based agents fragile in production — things like missing decision-path tracing, action-level cost monitoring, weak failure-recovery patterns, and insufficient test environments. Shan Han argues these operational holes are why prototypes hallucinate, retry endlessly or blow budget, and he calls out tracing, cost-by-action metrics and smarter retries as concrete fixes. (x.com)
A large language model agent is not one program doing one job. It is a loop that reads a request, chooses a next step, calls a tool, reads the result, and decides again until it stops. (openai.com) That loop is why demos look smart and production systems look weird. The same user request can trigger different tool paths on different runs, because natural language inputs are effectively unbounded and model behavior is not fully deterministic. (langchain.com) Traditional monitoring was built for software with fixed routes, like a checkout page or an application programming interface endpoint. Agent systems break that model because one request can fan out into many model calls, tool calls, and handoffs before anyone sees the final answer. (langchain.com) That is why engineers keep asking for tracing. A trace is an end-to-end record of one run that shows model calls, tool calls, guardrails, and handoffs in sequence, like a flight recorder for a cockpit. (openai.com) Without that flight recorder, a bad answer is hard to debug. You cannot tell whether the agent picked the wrong tool, followed the wrong instruction, or got stuck after a handoff unless you can replay the path it actually took. (openai.com) Cost is the second blind spot. In agent systems, spending is not one bill per request, because each step has its own token count and tool cost, and one misrouted loop can multiply both in minutes. (grafana.com) That is why action-level cost tracking keeps coming up. Grafana’s agent observability guide says teams need token and cost data for each step so they can see which tool call or model choice is driving spend and reroute simple work to cheaper paths. (grafana.com) The third gap is recovery. Many agent failures are not clean crashes with one stack trace; they are soft failures where the system retries, calls another tool, retries again, and never reaches a stable end state. (langchain.com) A recent research survey of production agents found that teams already compensate by keeping systems simple and controllable. In 20 case studies and a survey of 306 practitioners across 26 domains, 68 percent of production agents took at most 10 steps before human intervention, and reliability was the top development challenge. (arxiv.org) That is the backdrop for Shan Han’s point. The hard part is no longer getting an agent to do something once on a laptop; the hard part is building the missing plumbing so a real system can show its decisions, meter each action, stop bad loops, and survive contact with users. (x.com) The practical fixes are not mysterious. Teams are converging on three boring-sounding pieces of infrastructure: traces for every run, metrics for every action, and retry rules that escalate or stop instead of letting the model keep guessing forever. (openai.com) (grafana.com) (arxiv.org)