Build an agent observability tool

A recent newsletter on coding agents lays out how agent loops, memory and tool calls work in practice and points to a gap: engineers need tools to trace runs, diffs and failures. A simple project that logs prompts, tool calls, file diffs, latency and retry counts (backed by Postgres and WebSockets) would demonstrate backend chops and product sense around AI tooling. (The Tokenizer Edition)

An agent is not one answer from one model call. It is a loop: read the request, decide what tool to use, run the tool, look at the result, and repeat until it can stop. (platform.claude.com) Claude Code’s own documentation spells that loop out step by step, and says a simple question can take one or two turns while a harder coding task can chain dozens of tool calls across many turns. (platform.claude.com) That is why agent bugs feel slippery. The failure is often not in the final answer but in turn 7, when the agent read the wrong file, called the wrong command, or retried the same bad step three times. (platform.claude.com) The tooling world is already moving toward traces for this reason. OpenAI’s Agents software development kit says tracing is built in by default and records model generations, tool calls, handoffs, guardrails, and custom events during a run. (openai.github.io) LangSmith does the same thing for LangChain agents. Its docs say traces capture the path from the first user input to the final response, including tool calls, model interactions, and decision points. (docs.langchain.com) So the interesting build is not “make another coding agent.” The interesting build is the black box recorder for coding agents: a dashboard that shows the prompt, every tool call, every file edit, every retry, and the exact moment the run went sideways. (developers.openai.com) A good first version only needs five moving parts. Store each run in PostgreSQL, stream updates over WebSockets, capture latency for every step, save diffs for every file write, and mark retries and failures as first-class events instead of burying them in logs. (openai.github.io) (developers.openai.com) The product shape is simple to picture. One screen lists runs like a delivery tracker, and clicking a run opens a timeline where you can see “read auth.ts,” “ran npm test,” “edited auth.test.ts,” and “timed out after 12.4 seconds” in order. (platform.claude.com) (docs.langchain.com) The killer feature is the diff view. When a coding agent touches 14 files and one hidden change breaks the build, engineers do not want a paragraph of explanation; they want the exact before-and-after patch tied to the tool call that made it. (platform.claude.com) This kind of project also signals something employers now care about. Anthropic says the majority of its code is now written by Claude Code, which means the bottleneck is shifting from “can you call a model” to “can you supervise a system that writes, runs, and revises code on its own.” (anthropic.com) A polished version could add alerts for loops that exceed 20 turns, heat maps for slow tools, and side-by-side comparisons of two prompts on the same repo. That turns a weekend tracer into the kind of infrastructure every team wants once agents move from demo to production. (openai.github.io) (docs.langchain.com)

Build an agent observability tool

Get your own daily briefing