MemPalace tops LongMemEval

A posted result shows MemPalace achieved 96.6% on the LongMemEval benchmark for structured, persistent memory, claiming an edge over typical RAG setups by combining a knowledge graph and verbatim storage. The tweet frames this as evidence of stronger performance for structured, long‑term agent memory representations. (x.com/socialwithaayan/status/2043219929114313099)

Artificial intelligence assistants are being tested on a simple problem: can they remember details from old chats after the window closes? A benchmark called LongMemEval was built to measure that across 500 questions and five memory skills. (arxiv.org) The benchmark asks systems to recover facts across multiple sessions, track time, handle updated information, and know when not to answer. The LongMemEval authors said commercial assistants and long-context models showed a 30% accuracy drop on sustained-interaction memory tasks. (arxiv.org) Into that test, MemPalace has posted a 96.6% recall-at-5 score in what it calls “raw mode,” meaning it stores conversations verbatim and retrieves them without external application programming interface calls. The project’s public benchmark page says that score came on all 500 LongMemEval questions. (github.com) MemPalace’s core claim is that memory works better when a system keeps the original exchange instead of first compressing it into a short fact. Its repository says the storage layer uses ChromaDB for verbatim search and SQLite for a temporal knowledge graph that tracks relationships over time. (github.com) That design sits in the middle of a live argument in agent engineering. The LongMemEval paper said “extracting user facts for indexing improves both memory recall and downstream question answering,” while MemPalace argues that summarizing first can discard the reasoning a later query needs. (arxiv.org, github.com) The benchmark itself measures retrieval quality, not whether a chatbot gives the best final answer in a real product. MemPalace’s own benchmark notes label the 96.6% figure as recall-at-5 on LongMemEval, and separate it from higher scores that use reranking with a language model. (github.com) The LongMemEval maintainers have also updated the benchmark since launch. Their GitHub repository says a cleaned version of the history sessions was released in September 2025 to reduce interference with answer correctness, after the benchmark was first released in October 2024 and accepted to the International Conference on Learning Representations in February 2025. (github.com) MemPalace’s result has also drawn criticism from developers examining the repo in public. One GitHub discussion said the 96.6% number is reproducible but argued that LongMemEval is an end-to-end question-answering benchmark, while an issue in the same repository said the headline score does not exercise every MemPalace-specific feature. (github.com, github.com) Those objections do not erase the posted score, but they narrow what it shows. Right now, the cleanest reading is that a verbatim-heavy memory store paired with structured metadata can retrieve old chat details very well on LongMemEval, while the wider debate over what counts as “memory” in a production agent is still being fought in public. (github.com, arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.