MemoryBench posts continual‑learning benchmark
- MemoryBench — a continual‑learning benchmark from Qingyao Ai and colleagues at Tsinghua — is on arXiv and has code on GitHub and Supermemory. - OpenReview shows ICML 2026 program chairs rejected the submission on Jan 26, 2026 — reviewers flagged heavy LLM‑as‑user simulator reliance and limited novelty. - It uses 11 datasets to test declarative and procedural memory — revealing big gaps in LLM continual learning and pushing for standard evaluation tools.
Lede LLM memory and continual learning are the domain — how models keep facts and skills as they interact. This matters because real applications need models that learn from feedback instead of forgetting or hallucinating more. Existing tests mostly check static recall on reading tasks — not live update and retention. The MemoryBench paper and open‑source code aim to change that by simulating user feedback and measuring both declarative facts and procedural skills. What is MemoryBench? MemoryBench is a benchmarking framework that simulates interactive users and runs sessions where an LLM must store, recall, and adapt over time. It bundles generators, judges, and scoring to make memory experiments repeatable — the idea is to treat memory as a measurable product, not a fuzzy property. What does it actually test? The benchmark separates declarative memory (facts) from procedural memory (how to do tasks) and runs sequences of interactions across tasks and languages. The paper describes using 11 datasets to stress different failure modes — editing, long‑horizon retention, and procedural adaptation. Who wrote it and where is the code? The paper lists Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. The authors put the paper on arXiv and maintain runnable code and experiment scripts on GitHub; another implementation and docs live in Supermemory’s MemoryBench pages. Did it appear at ICML 2026? The team submitted to ICML, but program chairs recorded a rejection on Jan 26, 2026. Reviewers praised the motivation but raised concerns about realism and novelty — especially the choice to simulate users with an LLM. What did their experiments show? Their runs found that off‑the‑shelf retrieval and memory strategies leave a lot of room for improvement — baselines struggled on procedural tasks and noisy feedback. The authors re‑ran experiments with a unified retriever and swapped backbone models to test robustness, yet the main ranking patterns stayed the same. Why does this matter for builders? If you build a memory layer or a RAG system, you want a repeatable way to measure forgetting and update quality. MemoryBench offers standardized sessions and metrics — which helps compare providers or tune memory policies instead of eyeballing chat logs. That could shift how teams prioritize ongoing training versus architecture changes. What's the catch? The catch is realism — many reviewers worried that an LLM simulating a user will bias results toward architectures that mirror the simulator. That undermines one of the paper’s goals: realistic human‑in‑the‑loop evaluation. The authors pushed back with sensitivity tests, but the tension remains. Bottom line MemoryBench is a practical toolkit and dataset suite that makes continual memory research easier to run and reproduce — it highlights clear weaknesses in current systems, even if its evaluation choices will keep sparking debate.