Alex Smale wants agent trace replays

- Alex Smale used a May 3 blog post to argue that AI teams need trace-and-replay tooling, not just uptime dashboards, before autonomous agents touch real work. - His checklist is concrete: capture user input, planning steps, memory reads and writes, tool routing, API responses, approvals, errors, latency, tokens, and cost. - The idea matters because agent observability is fast becoming core infrastructure for debugging, governance, and regression testing across production AI stacks.

AI agent observability is becoming its own software category — and Alex Smale’s pitch is simple: if an agent can touch money, customers, or compliance, you need to be able to replay what it did. In a post published May 3, he argued that normal app monitoring is not enough for autonomous systems because it tells you whether a service stayed up, not why an agent picked a tool, ignored policy, or looped into a bad action. That gap is starting to matter now because more teams are moving from chatbot demos into tool-using agents that actually do work. ### What is he actually asking for? He wants observability stacks to store a full execution record of an agent run — basically a step-by-step trail that shows what the agent saw, what it decided, which tools it called, what came back, and where the run went wrong. Smale frames the core package as traces, replays, and post-mortems. The point is not pretty dashboards. The point is being able to inspect one bad run like an incident review, then learn from it. ### Why aren’t logs enough? Because agents fail in weirder ways than normal software. A web app bug is often deterministic — same input, same broken output. An agent can take different paths on different runs, call different tools, pull different context, and still return a fluent answer that is wrong. Plain logs usually show fragments. A trace shows the whole path. That is why newer observability tools tell you where the bad decision entered the chain. ### What has to be inside a useful trace? Smale’s list is more detailed than the slogan. He says teams should record user input and attached context, planning steps or reasoning summaries, memory reads and writes, tool selection and routing choices, external API calls and responses, approvals, outputs, errors, latency, token usage, and cost per run. That is basically the minimum needed to answer the painful question after an incident — “what exactly happened here?” ### Why does replay matter so much? Because replay turns a weird one-off failure into something engineers can inspect and test. If a refund agent chose the wrong tool or a research agent hallucinated a step, a replay lets a team walk through the sequence without waiting for the same failure to happen again in production. In practice, that means you can convert bad real-world traces into regressions. ### Is this just one person’s hobbyhorse? Not really. Smale’s post landed into a broader shift across the AI tooling market. LangChain, Braintrust, and others are all making the same basic argument from different angles — traces are the raw material for improvement, and production agent systems need visibility into tool calls, memory, branching, and failure patterns. The interesting part is that Smale pushes the idea in business-risk language, not just developer-productivity language. ### What problem is he really pointing at? Black-box automation. A lot of agent demos look impressive because they finish a task. But once the task touches refunds, compliance workflows, or customer operations, “it seemed to work” stops being enough. Teams need an audit trail. They need to know whether an agent invented a step, retried five times, or quietly handed a broken mess to a human. That is the real stakes argument underneath the post. ### So what changed today? The new thing is not the abstract concept of tracing. It is that people like Smale are now treating replayable agent traces as table stakes for deployment, not a nice-to-have debugging extra. That is a sign the market is maturing. Agents are moving from toy workflows into systems that need post-mortems, access controls, cost attribution, and repeatable tests — basically the boring infrastructure every serious software stack eventually grows. ### Bottom line Smale’s argument is straightforward —

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.