MemPalace nails long‑memory benchmark

MemPalace, an open‑source agentic memory tool co‑developed by Ben Sigman and Milla Jovovich, scored 100% on the LongMemEval benchmark—outrunning published results—and is available on GitHub for contributors. The project positions itself as a ready building block for resilient agent memory in side projects (x.com) (x.com).

# MemPalace nails long-memory benchmark A new open-source project called MemPalace says it hit a perfect score on one of the best-known public tests for long-term artificial intelligence memory, and the claim spread quickly because the repository is tied to an unexpected pairing: developer Ben Sigman and actor Milla Jovovich. The GitHub project describes itself as “the highest-scoring AI memory system ever benchmarked,” and its release page says it reached 96.6% recall at 5 on LongMemEval with zero application programming interface calls, and 100% with an optional reranking step. (github.com 1) (github.com 2) To understand the claim, it helps to know what “memory” means in this corner of artificial intelligence. Most large language model assistants are good at the last few turns of a conversation, but they often lose track of details from older chats, like a user’s project decisions, preferences, or earlier bug reports, once those details fall outside the model’s working context window. (github.com) That limitation has created a small industry of memory layers for agents. Instead of trusting the model to remember everything, developers store past conversation fragments in a database and try to retrieve the right snippets later, the way a search engine pulls up old emails or documents when you type a query. (github.com 1) (github.com 2) LongMemEval is one of the benchmarks built to test whether those systems actually work. The benchmark’s public repository says it contains 500 questions designed to measure five long-term memory abilities: information extraction, multi-session reasoning, knowledge updates, temporal reasoning, and abstention. (github.com) The setup is harder than a simple “needle in a haystack” test because the assistant is supposed to answer after many prior chat sessions, not from a single clean note. In practice, the system has to find which earlier session matters, pull the right detail from it, and avoid mixing it up with similar facts that appeared at other times. (github.com) MemPalace’s pitch is that storing everything is better than letting a model decide what is worth saving. The project README says other systems compress a conversation into a few extracted facts, while MemPalace keeps the full text and organizes it into a spatial structure modeled on the ancient “memory palace” idea of placing information in imagined rooms. (github.com) In MemPalace’s description, conversations are arranged into “wings,” “halls,” and “rooms,” so a model can search memory more like walking through a building than scanning a flat pile of notes. The repository also claims a separate shorthand format called AAAK can compress text by 30 times with zero information loss so a model can load much more history at once. (github.com 1) (github.com 2) The project is not just a benchmark claim on a social post. It is live on GitHub under an MIT license, includes benchmark scripts in the repository, lists Python 3.9 and above as a requirement, and invites outside contributors through a public CONTRIBUTING file and issue tracker. (github.com) (github.com) That makes the story more interesting than a closed-source leaderboard boast, because anyone can inspect the code and test the benchmark path themselves. The LongMemEval repository is also public, which means outside developers can compare MemPalace’s setup against the benchmark’s own data and evaluation flow instead of taking the headline on faith. (github.com) (github.com) And that is where the story gets messier. A public GitHub issue filed on April 8, 2026 argues that MemPalace’s headline 96.6% result does not actually test MemPalace-specific logic in its main “raw” mode, and instead relies on straightforward Chroma database retrieval with a standard embedding model. (github.com) That issue goes further and claims the parts that appear most specific to MemPalace, including the “rooms” mode and the AAAK compression mode, scored lower than the plain retrieval baseline on LongMemEval. According to the issue text, “rooms” scored 89.4% recall at 5 and AAAK scored 84.2%, while the headline “raw” score of 96.6% did not import most of the MemPalace library at all. (github.com) The same issue also challenges the benchmark framing itself. It says the benchmark script creates a fresh vector database collection for each question from only about 50 sessions, which turns the task into finding the right session in the top 5 out of roughly 50 candidates, a much narrower problem than remembering across one giant persistent life log. (github.com) None of that automatically disproves the release’s strongest claim, which is the optional 100% result with reranking, but it does change how the result should be read. A perfect score on a public benchmark can mean “best published pipeline on this exact setup” without necessarily meaning “best real-world memory architecture for autonomous agents.” (github.com) (github.com) There is also a timing problem that often appears in fast-moving artificial intelligence launches. MemPalace’s release page says version 3.0.0 was published about 19 hours before capture, while the repository itself shows only a handful of commits over roughly two days, so the public conversation is moving faster than the normal independent replication cycle. (github.com) (github.com) Even with those caveats, the project has clearly hit a nerve. The GitHub repository showed thousands of stars within about a day of release, which usually signals that developers are hungry for practical memory components they can plug into coding agents, research assistants, and side projects without paying for a hosted service. (github.com) (github.com) So the cleanest version of the story is this: MemPalace is a newly released open-source memory project that publicly claims 96.6% and 100% LongMemEval results, ships code that others can inspect, and presents itself as a local-first building block for agent memory. At the same time, public scrutiny has already started, and some of the most detailed criticism says the benchmark headline may reflect a strong retrieval baseline more than the project’s signature “memory palace” machinery. (github.com) (github.com) (github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.