Memento open‑sourced for context control
Microsoft Research published Memento, a system that teaches models to self‑manage their context windows so they can handle long or changing conversation state without external orchestration. The release includes technical write‑ups and code, signalling a push toward models that internalize session management rather than relying purely on prompt replay. That could change how teams design session state and checkpointing in retrieval‑augmented or agentic systems. (x.com)
Large language models can now spend hundreds of thousands of tokens thinking through one hard math or coding problem, which Microsoft Research says can make a single answer as long as a book. Every token stays in working memory unless something throws it away. (microsoft.com) That working memory is called the key-value cache, and it is the running scratchpad a model uses to keep track of what it already wrote. The longer the scratchpad gets, the more memory and compute each next token costs. (microsoft.com) Most teams handle that problem from the outside with summarizers, prompt trimming, or fresh application programming interface calls that replay a shorter history. Microsoft’s pitch is to teach the model itself when to keep notes and when to forget the full draft. (microsoft.com) Memento does that by having the model split its reasoning into blocks, like a student solving one part of an exam before moving to the next page. At the end of each block, the model writes a short “memento” that keeps the conclusions, formulas, and intermediate values it still needs. (microsoft.com) After that note is written, the earlier block is masked from attention and its cache entries are flushed, so the model stops carrying the full text of its old reasoning. It moves forward with the compact note plus the block it is currently working on. (microsoft.com) Microsoft says standard supervised fine-tuning on about 30,000 examples was enough to teach this behavior. In its April 8, 2026 write-up, the team reported peak key-value cache cuts of 2 to 3 times and serving throughput that nearly doubled. (microsoft.com) The strange result is that deleting the old block does not fully delete its influence. The researchers say erased reasoning still leaves traces inside the cache representations, creating an “implicit second channel” that the model keeps using. (microsoft.com) That detail matters because it means Memento is not just summarization with a new name. The model is learning a rhythm of think, compress, and continue, while some information from the discarded text still survives in hidden form. (microsoft.com) Microsoft open-sourced the paper, a dataset called OpenMementos with 228,000 annotated traces built on OpenThoughts-v3, the data-generation pipeline, and a fork of the vLLM serving engine with native block masking. The GitHub repository went public about two days before April 10, 2026. (github.com) (microsoft.com) If this idea holds up outside math and coding, it could change how retrieval-augmented generation systems store session state. Instead of replaying a whole chat log every time, a system could keep a chain of model-written checkpoints and let the model manage more of its own memory budget. (microsoft.com)