Tidianez details decision-context logging

- Tidianez said on May 21 that SafeRun’s design starts with capturing decision-time context, arguing ordinary traces and logs are not enough to diagnose agent failures. - A SafeRun design note said teams need replayable records of tool arguments, retrieved context, policy versions and reasoning between calls to investigate incidents. (dev.to) - The post and companion note are available on Tidianez’s X account and in SafeRun’s May 21 design write-up. (dev.to)

Tidianez’s latest SafeRun notes are about a familiar failure in production agents: the agent does something it should not do, and the team cannot reproduce why. In a May 21 post, he argued that ordinary logs are too thin for that job and said teams need to capture the full decision context before a tool or API call is made. A SafeRun design note published the same day makes the same case more explicitly, saying replay has to be built from the decision itself rather than reconstructed later from flat traces. (dev.to) The core claim is narrow but practical. (dev.to) If an agent’s run only records that a tool was called, a team can see the outcome but not the exact state that produced it. SafeRun’s write-up says post-incident debugging breaks down when the model’s reasoning between tool calls, the failed call’s full arguments, the retrieved context and the agent’s plan are missing. ### What does Tidianez say simple logs miss? The SafeRun note says standard observability tools can describe “what happened,” but that is different from reproducing the decision path that led to a bad action. (dev.to) It says replay requires “complete state” capture with enough fidelity to step through a run after the fact, including exact tool arguments, reasoning between calls, retrieved context at each decision point, the policy that evaluated each action and the decision returned. That framing lines up with how other agent platforms describe their own control points. (dev.to) OpenAI’s Agents SDK says its tracing records LLM generations, tool calls, handoffs and guardrails, while LangChain’s human-in-the-loop middleware pauses tool execution when a proposed action matches a review policy. Those systems show the industry already treats tool calls and approvals as first-class runtime events; Tidianez’s argument is that incident response also needs the surrounding decision state preserved. ### Which pieces of “decision context” are supposed to be captured? (dev.to) Tidianez’s post, as reflected in the SafeRun note, centers on recording the inputs around each action before execution. The design note lists decision-time context snapshotting such as inputs, retrieved context, external state, policy version and evaluator model version, and says those snapshots should be captured synchronously and persisted asynchronously. The same write-up says replay also depends on versioning “every policy and every rule and every classifier” that participated in a decision. (openai.github.io) That is the part aimed at rule creation after an incident: if a team knows which policy fired, what risk signal was returned and which approval path was required, it can write a narrower prevention rule instead of guessing from a partial trace. ### Why is the missing-trace problem so painful after an incident? The SafeRun note describes the failure mode in operational terms. It says engineers often rerun a non-deterministic agent for hours or days trying to recreate one bad action because the original reasoning trace, retrieved context and plan were not stored. (dev.to) The note calls that “universal pain” and says Tidianez had heard versions of it from roughly 20 engineers shipping agents in production. Anthropic has described a related operational problem from another angle. In a March engineering post about Claude Code permissions, the company said it keeps an internal incident log of agentic misbehavior, including cases involving remote git branches, authentication tokens and production database migrations. (dev.to) That example is different from SafeRun’s replay argument, but it shows why teams want detailed records around risky actions and approvals. ### How does this fit with approval systems and guardrails? LangChain’s documentation says human review can approve, edit, reject or respond to a paused tool call, depending on policy. (dev.to) The system saves graph state so execution can stop and resume later. OpenAI’s tracing documentation similarly describes guardrails and tool calls as part of the run record. Tidianez’s contribution is to push one step earlier in the timeline. His argument is that approval outcomes and tool events are useful, but they still leave a gap if teams do not preserve the exact context that existed before the action was proposed. (anthropic.com) SafeRun’s product loop, according to the design note, is “Replay → Understand → Create Rule → Prevent,” with replay as the foundation for the rest. ### What comes next from SafeRun? The May 21 SafeRun note says the company plans Python and TypeScript SDKs and describes a `@guard` decorator that wraps any tool call. (docs.langchain.com) The post frames that work as infrastructure for capturing decisions at the moment they are made, rather than reconstructing them from logs after an incident. (dev.to)

Tidianez details decision-context logging

Get your own daily briefing