Microsoft releases STATE‑Bench to evaluate AI agent memory and long‑term state
- Microsoft on May 19 released STATE-Bench, an open-source benchmark designed to test whether AI agents improve over time on enterprise-style tasks. (opensource.microsoft.com) - The GitHub repository says STATE-Bench is “memory-agnostic” and measures whether agents can learn from prior trajectories and handle realistic workflows. (opensource.microsoft.com) - Microsoft published the benchmark on its Open Source Blog and released code, datasets and run instructions in the public GitHub repository. (opensource.microsoft.com)
Microsoft has put a new number on a problem agent developers have mostly described qualitatively: whether an AI agent actually gets better after doing work before. On May 19, the company released STATE-Bench, short for Stateful Task Agent Evaluation Benchmark, as an open-source benchmark aimed at measuring agent memory and long-term state across enterprise-style tasks. (opensource.microsoft.com) Microsoft said the project is “memory-agnostic,” meaning it is meant to test outcomes rather than prescribe a single memory architecture. That matters because “memory” in agent systems has become a catch-all term for several different capabilities: retaining context inside a session, storing facts across sessions, reusing prior trajectories, and avoiding repeated mistakes. (opensource.microsoft.com) Microsoft’s framing is narrower and more operational. In its announcement, the company said STATE-Bench measures whether agents “improve with experience on realistic enterprise tasks,” not whether they can merely retrieve information from a long context window. ### What, exactly, is Microsoft testing here? STATE-Bench is built around interactive domain scenarios that resemble enterprise workflows rather than static question-answer tests. The GitHub repository says each domain exposes a fixed set of tools and policies, and each task begins in a just-in-time sandbox populated with task-specific users and artifacts such as flight bookings, customer orders, carts and product records. (opensource.microsoft.com) Microsoft’s blog says the target failure modes are the ones developers see in production-like agent runs: skipping policy checks, mishandling incomplete user details, using tools incorrectly or inefficiently, and repeating the same mistake on the next attempt. The benchmark is meant to see whether memory helps agents correct those patterns over time. (opensource.microsoft.com) ### Why is “memory-agnostic” a notable design choice? Microsoft described STATE-Bench as “memory-agnostic,” which means the benchmark is not tied to one storage layer, one framework or one retrieval method. That lets developers compare different approaches to state management using the same tasks and scoring setup, instead of evaluating only agents built around a single vendor stack. (github.com) Microsoft has been building a broader agent stack that includes agent sessions, context providers and persistence features in its Agent Framework documentation. In that context, STATE-Bench looks less like a standalone research artifact and more like part of a larger push to make state and memory testable pieces of infrastructure. That is an inference based on Microsoft’s documentation and launch materials. (opensource.microsoft.com) ### How is this different from older memory benchmarks? Other public projects have tried to benchmark agent memory, but many focused on chatbot-style recall or long-context retrieval. Vectorize’s agent-memory benchmark, for example, argues that older datasets can be flattered by very large context windows because retrieval is easier than it used to be. Microsoft’s benchmark instead centers on tool use, policies and repeated enterprise tasks inside controlled environments. (opensource.microsoft.com) That distinction is important because an agent that can quote prior conversation history is not necessarily an agent that can handle a refund policy correctly on the second try or avoid rebooking the wrong flight. Microsoft’s examples put the emphasis on workflow behavior, not just recall. (learn.microsoft.com) ### Where does this fit in Microsoft’s broader agent push? Microsoft’s recent developer materials have emphasized production-grade agents, including state management, memory and observability. The company’s Agent Framework documentation refers to agent sessions for state management and context providers for memory, while a recent Foundry blog post highlighted new memory capabilities and tracing tools. (github.com) STATE-Bench fits that pattern by giving developers a way to test whether those systems actually improve agent behavior over repeated tasks. Microsoft said the benchmark is available now through its Open Source Blog and the public GitHub repository, which includes datasets, setup files and a guide for running evaluations with OpenAI and Azure OpenAI models. (learn.microsoft.com) (opensource.microsoft.com)