Open-sourced long agent traces

- A team published production agent traces showing workflows with more than 20 tool turns and roughly 10,000-token prompts. (x.com) - The traces were released as realistic benchmarks and debugging examples for complex multi-tool agent behavior. (x.com) - Authors said these logs help reproduce hallucination loops and multi-tool failures that simple tests miss. (x.com) (x.com)

A team has released real production-style AI agent traces, giving outsiders a look at long tool-using workflows that usually stay inside companies. (x.com) The authors said some runs stretch past 20 tool calls and use prompts around 10,000 tokens, a scale closer to deployed agents than to one-shot chatbot demos. The release was framed as both a benchmark and a debugging set for multi-tool systems. (x.com 1) (x.com 2) An agent trace is the step-by-step record of what an AI system saw, decided, and did: model calls, tool invocations, handoffs, retries, and outputs. OpenAI’s Agents SDK and Microsoft Foundry both describe tracing as the core way to inspect agent behavior in development and production. (@openai.github.io) (github.com) The push for longer traces comes from a simple problem: agents fail across many steps, not just in a final answer. Anthropic wrote in January 2026 that multi-turn agent evaluations are harder than single-turn tests because tools, state changes, and intermediate mistakes can compound over time. (anthropic.com) Researchers have been building around that gap for the past year. A Carnegie Mellon University and Microsoft Research paper from March 2025 said developers struggle to review long agent conversations, localize errors, and reset workflows to test fixes. (arxiv.org) Recent benchmarks have moved in the same direction. Patronus AI’s TRAIL dataset, published in 2026, includes 148 annotated agent traces with 841 total errors, average inputs above 200,000 tokens, and top-model joint accuracy of 11% or less on its debugging task. (patronus.ai) (huggingface.co) The new release is narrower than those ultra-long corpora but closer to the traces engineers inspect when a live agent gets stuck in a loop. The authors said the logs can reproduce hallucination spirals and multi-tool breakdowns that short benchmark prompts often miss. (x.com 1) (x.com 2) That also puts pressure on a persistent tradeoff in agent observability: the more detail a trace records, the more likely it is to capture sensitive prompts, user inputs, or tool results. Microsoft’s tracing guidance warns that full content recording is useful for debugging but should be disabled in production when privacy risks outweigh the benefit. (github.com) For teams building agents, the practical value is less about leaderboard scores than replay. A trace lets a developer inspect the exact step where an agent chose the wrong tool, lost context, or kept searching after the task was already solved. (@openai.github.io) (anthropic.com) The release lands as agent builders are shifting from prompt demos to workflow debugging. Publishing the traces does not fix those failures, but it gives the field more of the raw material needed to find them. (x.com) (patronus.ai)

Open-sourced long agent traces

Get your own daily briefing