Substack refines agent production playbook

- On May 24, 2026, an engineering thread described how Substack hardened growth automation agents for production by replacing demo-style checks with operational debugging systems. - The clearest change was moving past a binary “did it run” test toward richer evals, structured logs, self-debugging loops, and action-by-action review. - The thread is on X from Manish Kumar; Substack engineers are likely to keep refining evals, logs and operator review flows.

Substack’s latest engineering lesson on agents is not about a new model or a new workflow. It is about what broke when a promising automation system met production conditions. In a thread shared on May 24, Manish Kumar described how Substack’s growth automation agents exposed a gap between demo success and day-to-day reliability, with the team finding that simple checks on whether an agent completed a run were not enough to judge performance. That account lines up with a broader pattern in agent engineering. Anthropic wrote in a February engineering post that agent evaluations are harder than standard model evals because agents use tools across many turns, change state in their environment and can fail in ways that compound over time. Google Cloud said in a March post that reliable agents need continuous evaluation built from production monitoring, automated scoring and human feedback rather than one-off “vibe checks.” (x.com) ### Why did a working demo stop being enough? Substack’s thread said the first problem was evaluation scope. A demo can look successful if the agent finishes a task, but that leaves open whether it took the right steps, used the right tools, or produced output that would hold up under repeated production use. (anthropic.com) Kumar said the team had to go beyond the basic question of whether the system ran. Anthropic described the same issue in more general terms, saying agent systems need evaluations that account for multi-step behavior and the possibility that a model may appear to complete a task while still violating the intended constraints of the test. (cloud.google.com) ### What did Substack add to catch those failures? The thread said one remedy was richer evals. Instead of treating each run as pass or fail, the team added more detailed checks around the quality of intermediate behavior and outputs, according to Kumar’s account. (x.com) Google Cloud’s guidance describes that approach as continuous evaluation, with teams measuring behavior in production and feeding those results back into prompts, tools and control logic. (anthropic.com) Arize AI, in an April field report on production agents, said developers repeatedly identified the same challenge: shipping an agent is easy, but understanding whether it works in production requires tracing, evaluation and inspection of the steps inside a run. (x.com) ### How did the team use logs for self-debugging? Kumar said Substack also used structured logs so the model could inspect what happened and help debug its own failures. (cloud.google.com) That shifts logs from a passive record into an input for remediation. LangChain’s LangSmith team described a related production pattern in December, saying debugging deep agents depends on tooling that can surface the exact sequence of actions, tool calls and state changes that led to a failure. (arize.com) The common requirement is structure: logs have to be organized well enough for both humans and models to reason over them. ### Why does an action-by-action UI matter? (x.com) The thread said Substack built a user interface that shows each agent action for human inspection. That gives operators a way to review what the system actually did, not just what final result it returned. That kind of visibility has become a recurring demand in production agent tooling. (langchain.com) Arize said teams want human-debuggable traces rather than aggregate dashboards alone, while Google Cloud said human feedback remains part of the evaluation loop even when automated scoring is in place. ### What does this say about the production playbook? Substack’s thread points to a narrower, more operational definition of agent readiness: richer evals, structured traces, model-assisted debugging and a review surface for humans. (x.com) Kumar presented those as the systems that closed the gap between a demo and a production workflow. The next place to watch is the engineering discussion itself. Kumar’s May 24 thread on X is the public record for these changes, and related work across Anthropic, Google Cloud, Arize and LangChain suggests the same set of tools — evals, traces and inspection UIs — is becoming standard in production agent operations. (arize.com) (x.com)

Substack refines agent production playbook

Get your own daily briefing