LongMemEval-V2 benchmark released
- Di Wu and co-authors released LongMemEval-V2 on May 13, 2026, publishing a benchmark, codebase and datasets for testing long-term memory in LLM agents. - The benchmark spans 451 manually curated questions and histories as large as 115 million tokens, with AgentRunbook-C reaching 72.5% average accuracy. - GitHub hosts the evaluation harness, baseline memory modules and leaderboard tooling, while the preprint is posted on arXiv. (arxiv.org)
Di Wu and six co-authors posted LongMemEval-V2 this week as a new benchmark for testing whether large language model agents can retain and use long-term experience in specialized web environments. The preprint was submitted to arXiv on May 12, 2026, and the project site and GitHub repository were live by May 14. The benchmark is framed around a practical question: whether a memory system can turn long histories of web-agent interaction into evidence that helps answer later questions about a specific environment. (arxiv.org) The release includes evaluation code, data preparation tools, baseline memory systems and leaderboard packaging utilities. ### What exactly did the researchers release? LongMemEval-V2 is described by the authors as a benchmark for “evaluating long-term agent memory toward experienced colleagues,” with the goal of measuring whether agents can accumulate environment-specific know-how over time. The authors are Di Wu, Zixiang Ji, Asmi Kawatkar, Bryan Kwan, Jia-Chen Gu, Nanyun Peng and Kai-Wei Chang. The official repository says it contains the public evaluation harness, data preparation tools, leaderboard utilities and the baselines reported in the paper. (arxiv.org) The GitHub repository and project website both say the benchmark pairs manually curated questions with long histories of multimodal web-agent trajectories. In the benchmark setup, a memory system reads trajectory history and returns compact evidence for downstream question answering, and the reported outcomes include both answer accuracy and query latency. ### How large is the benchmark? The arXiv abstract says LongMemEval-V2 contains 451 manually curated questions across five memory abilities and pairs them with history trajectories containing up to 500 trajectories and 115 million tokens. (arxiv.org) The project site repeats those scale figures and presents the benchmark as targeting sparse retrieval over very large “haystacks” of prior agent experience. The repository says the public leaderboard is split into two tiers, labeled small and medium, and that the benchmark covers two domains: web and enterprise. (github.com) The same materials say the five tested abilities are static state recall, dynamic state tracking, workflow knowledge, environment gotchas and premise awareness. ### What kinds of memory failures is it trying to measure? The authors say existing memory benchmarks often focus on user histories, short traces or end-task success rather than directly testing whether an agent has internalized environment-specific experience. (arxiv.org) LongMemEval-V2 is built around questions that ask whether an agent remembers page layouts, tracks state changes over time, recalls recurring workflows, avoids local failure modes and notices assumptions that hold elsewhere but not in the current deployment. (github.com) The project site describes that target as the behavior of an “experienced colleague” in a customized environment. That wording is the authors’ framing for agents that can recall interface affordances, workflows and recurring errors from prior interactions rather than relying only on the immediate prompt. ### Which systems are reported as baselines? The arXiv abstract reports two named memory methods from the paper: AgentRunbook-R, a retrieval-augmented approach with separate knowledge pools for observations, events and strategy notes, and AgentRunbook-C, which stores trajectories as files and uses a coding agent to gather evidence in an augmented sandbox. (arxiv.org) The abstract says AgentRunbook-C achieved 72.5% average accuracy, ahead of the strongest RAG baseline at 48.5% and an off-the-shelf coding-agent baseline at 69.3%. The same abstract says coding-agent methods carry high latency costs, while the repository states that latency is a core evaluation target alongside accuracy. That means the benchmark is not only scoring whether a system finds the right evidence, but also how quickly it can do so. ### Where can researchers use it next? The GitHub repository says the release is public and includes scripts for download, preparation, validation, evaluation and leaderboard submission packaging. (arxiv.org) The project site links the benchmark to a forthcoming code, data and leaderboard workflow, and the repository already exposes the directory structure for evaluation, leaderboard and memory modules. As of May 14, 2026, the preprint is available on arXiv under identifier 2605.12493, and the code is posted in the xiaowu0162/LongMemEval-V2 repository on GitHub. (arxiv.org) The next concrete step for outside teams is to run the public harness and submit results to the benchmark’s leaderboard tiers using the released packaging tools. (github.com)