Microsoft introduces STATE‑Bench benchmark
- Microsoft said on May 19 it released STATE-Bench, an open-source benchmark for testing whether memory improves AI agents on realistic enterprise tasks. (opensource.microsoft.com) - The first release covers 450 tasks across customer support, travel and shopping, with 300 public train trajectories and 150 public test tasks. (opensource.microsoft.com) - Microsoft published the benchmark on its Open Source Blog and GitHub, where developers can run evaluations or plug in custom memory systems. (opensource.microsoft.com)
Microsoft on May 19 released STATE-Bench, a new benchmark meant to test whether memory actually improves AI agents on enterprise-style work rather than simply helping them retrieve facts from long conversations. The company described the project as open-source and “memory-agnostic,” meaning developers can use it to evaluate different memory approaches instead of a single Microsoft stack. (opensource.microsoft.com) Microsoft said the benchmark is aimed at agent developers, researchers and platform teams. The release arrives as Microsoft has been pushing agents and open-source tooling more broadly around Build 2026. In its launch post, the company said existing memory tests often show only that a retrieval pipeline works, not that an agent performs procedures better in production-style settings. (opensource.microsoft.com) ### What problem is Microsoft trying to measure? Microsoft said the gap is most visible in enterprise workflows, where failures often come from procedure rather than recall. In the company’s description, a customer-support agent may fail by skipping policy checks, surfacing incomplete user details, using tools incorrectly or repeating the same mistake. (opensource.microsoft.com) Lewis Liu and Nishant Yadav, writing in Microsoft’s Open Source Blog, said that kind of failure is why the company built STATE-Bench. The benchmark is designed to measure whether agents “improve with experience” on realistic tasks, not just whether they can fetch a fact from earlier context. (opensource.microsoft.com) ### What does the benchmark actually include? The public release includes three domains — customer support, travel and shopping — with 450 tasks in total, according to Microsoft’s blog post and GitHub repository. GitHub documentation says the release includes 300 train task trajectories for memory extraction and 150 test task definitions with locked evaluation environments. (opensource.microsoft.com) Each task starts in what Microsoft calls a just-in-time sandbox with task-specific users and artifacts such as flight bookings, customer orders, carts and product records. The company said each domain exposes a fixed set of tools and policies, and each task is built as an interactive scenario an enterprise agent is likely to encounter. (opensource.microsoft.com) ### Why does Microsoft call it “stateful”? Microsoft said the benchmark focuses on tasks that change system state in a database, including refund records, booking status and account updates. In the company’s framing, those are not just conversational mistakes; they can create operational costs and cleanup work. The benchmark’s tasks also require multi-step procedures. (opensource.microsoft.com) Microsoft said agents may need to look up a booking, validate eligibility, check policy, calculate fees, confirm an action and then execute it, with a wrong or skipped step affecting the outcome. ### How is STATE-Bench evaluated? (github.com) Microsoft said each task has deterministic state assertions that define success. The orchestrator runs a multi-turn conversation loop in which the agent receives the conversation history and responds with tool calls and text. The company also said the benchmark measures user experience in addition to task success. (opensource.microsoft.com) Its blog post says Microsoft created a rubric with strict guidance for what it considers a user-centric interaction, alongside the state-based task evaluation. ### Where can developers find it next? Microsoft published STATE-Bench on May 19 through its Open Source Blog and a public GitHub repository under the microsoft organization. (opensource.microsoft.com) The repository includes benchmark code, datasets, documentation for running the benchmark and guidance for using a custom client, according to the project page.