Agent evals and memory labels emerging

Thinking of AI like a file system—where agents keep and retrieve persistent memory—rewrites what needs to be labeled: not just single responses but retrieval correctness, memory writes, state consistency and multi-step trajectory audits. That shift turns evaluation work into structured, longitudinal tasks that require new schemas, adjudication layers and provenance so labs can judge whether an agent's memory and tool actions remain reliable over time. (youtube.com) (platform.claude.com)

A chatbot that forgets everything after each reply can be graded like a spelling test. An agent that keeps notes, opens tools, edits files, and comes back tomorrow has to be graded more like an audit log. (anthropic.com) That shift is showing up in product design. OpenAI’s Agents software kit says agents “plan, call tools, collaborate across specialists, and keep enough state to complete multi-step work,” which means the thing being tested is no longer just one answer. (openai.com) Memory is the new moving part. OpenAI’s January 5, 2026 cookbook on long-term memory describes a pattern where an agent stores structured profile data and notes across runs, then injects only the relevant slice back into the next run. (openai.com) Anthropic is pushing the same idea from the infrastructure side. In its April 8, 2026 post on Managed Agents, it describes a “session” as an append-only log of everything that happened, alongside a harness that routes tool calls and a sandbox where the model can run code and edit files. (anthropic.com) Once an agent has a session log and persistent notes, a new kind of labeling job appears. Someone has to decide whether the agent retrieved the right memory, wrote down the right fact, kept the state consistent after a tool call, and stayed coherent across the whole trajectory. (openai.com) That is why evaluation is moving from answer checking to trace checking. OpenAI’s agent evals guide says a trace captures the full record of model calls, tool calls, guardrails, and handoffs for one run, and graders can score that workflow with structured criteria. (openai.com) Anthropic’s January 9, 2026 evals post makes the same point in plainer terms: single-turn tests were built for prompt-in, response-out systems, while agents operate over many turns, modify state in the environment, and let mistakes compound. (anthropic.com) That compounding changes what “correct” even means. An agent can produce a polished final answer after using the wrong tool, pulling the wrong memory, or saving a bad note that will poison the next session. (anthropic.com) (openai.com) It also creates a provenance problem. If a future run uses a remembered preference like “window seat” or a saved project rule like “deploy to staging first,” teams need to know when that memory was written, what evidence supported it, and whether a later run contradicted it. (openai.com) Anthropic’s context engineering work frames this as managing a limited context window from a constantly changing universe of information. In practice, that means the label is not just “good response” but “good selection of what to carry forward and what to leave out.” (anthropic.com) So a new layer of evaluation work is emerging around schemas, adjudication, and longitudinal review. The labs building memory-heavy agents now need labels for retrieval quality, memory-write quality, tool-use quality, and session-to-session consistency, because the failure may not show up until the fifth step instead of the first. (anthropic.com) (openai.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.