Debugging non‑deterministic agents

Engineers are treating LLM agent failures like probabilistic bugs and recommending 'printf debugging' for every intermediate state so you can reproduce and fix flaky multi-model pipelines. Practitioners advise structured logging, token tracking, latency metrics, assertions between steps, and dashboards using tools like LangSmith or Helicone to make intermittent planner or tool-call errors visible (x.com) (x.com).

Debugging non-deterministic agents A normal software bug breaks the same way every time. An agent bug can fail on the 37th run, pass on the 38th, and then call the wrong tool on the 39th even though the code never changed. That is why engineers are starting to treat agent failures less like classic defects and more like probabilistic bugs. The fix is not just reading source code. The fix is capturing every intermediate state so the bad run can be replayed, compared, and understood. Large language model agents are especially slippery because the “program” is spread across prompts, model outputs, tool calls, routing rules, memory, and external APIs. A single user request can turn into a planner step, several tool invocations, a retrieval pass, and a final synthesis, with each handoff creating another place for drift. The result is a system that often looks healthy from the outside. The application returns a response with a successful status code, but one hidden step may have taken 12 seconds, burned thousands of tokens, or chosen a tool with malformed arguments. That is the backdrop for a small but important shift in engineering practice this week. Practitioners on X described agent debugging in almost old-school terms: “printf debugging,” but for every stage of an agent pipeline, not just the final output. (x.com 1) (x.com 2) The idea is simple. If an agent can think in steps, call tools in steps, and fail in steps, then developers need logs for each of those steps, with enough structure to answer basic questions later: What prompt was sent, which model answered, what tool was selected, how long did it take, and what came back. That “structure” matters. Plain text logs are useful for a single request, but production agent systems generate trees of events, not neat linear sequences. LangSmith describes traces as the record of what an agent did and why, with runs capturing model calls, tool calls, and decision points inside that trace. (langchain.com) (docs.langchain.com) Once those traces exist, teams can look for patterns instead of anecdotes. LangSmith’s platform says it tracks cost, latency, errors, and qualitative metrics in dashboards and alerts, which turns one strange run into a measurable failure mode across hundreds or thousands of runs. (langchain.com) (docs.langchain.com) Token tracking has become part of that same discipline. LangSmith’s documentation breaks usage into input, output, and other token categories, then exposes those details inside traces and aggregate dashboards, so a team can see whether a prompt edit made an agent smarter or just more expensive. (docs.langchain.com) Latency is another clue that old debugging habits missed. In an agent pipeline, a bad answer may begin as a slow answer: one retrieval step stalls, one model call retries, one tool hangs, and the final response arrives too late to be useful. LangSmith documents first-token and end-to-end performance metrics, while Helicone emphasizes request monitoring and alerting around affected traffic. (docs.langchain.com) (docs.helicone.ai) Assertions between steps are becoming the guardrails around all of this. Instead of trusting the planner to produce valid tool arguments or trusting a sub-agent to return a usable schema, teams insert checks after each hop: did the model choose an allowed tool, did the payload match the schema, did retrieval return any documents, did the answer include required fields. That style fits the way newer agent frameworks are evolving. OpenAI’s Agents Software Development Kit says tracing is built in and can emit structured records of model calls, tool calls, handoffs, guardrails, and custom spans, which is exactly the data you need when a workflow fails only sometimes. (developers.openai.com 1) (developers.openai.com 2) Helicone is aimed at a similar visibility problem from a different angle. Its product and documentation focus on routing, debugging, analyzing requests, grouping related calls into sessions, and sending alerts when patterns break, which helps when one user task fans out into many model and tool requests. (helicone.ai) (docs.helicone.ai 1) (docs.helicone.ai 2) What changed is not that agents suddenly became flaky. They were flaky from the start. What changed is that more teams are now building multi-step systems in production, so intermittent failures are no longer curiosities. They show up as support tickets, cloud bills, and silent quality drops. That is why “printf debugging” has resurfaced in a field full of dashboards and tracing systems. The phrase sounds primitive, but the instinct is modern: if you cannot see every intermediate state, you do not really know what your agent did. The practical lesson is blunt. Treat every agent run like a distributed system trace, not a chatbot transcript. Log the prompt, the model, the tool choice, the tool arguments, the returned data, the token count, the latency, and the assertion result for each step. Once you do that, non-determinism stops looking like magic. It starts to look like a pile of ordinary engineering problems with timestamps, payloads, and failure rates attached.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.