MEMENTO reduces KV cache

Microsoft’s MEMENTO technique compresses LLM reasoning traces into compact “mementos,” cutting key–value cache size by roughly 2–2.5× while preserving model accuracy using a dual information stream in KV entries. (x.com) The approach promises lower memory costs for long-context or reasoning-heavy workloads. (x.com)

Large language models keep a running memory of every generated token, and that memory can swamp graphics-card limits on long answers. Microsoft researchers say a new method called MEMENTO cuts that key-value cache by about 2 to 2.5 times while keeping benchmark accuracy close to standard reasoning runs. (arxiv.org) (microsoft.com) That memory, called the key-value cache, stores the internal attention states a model reuses for each next token instead of recomputing them from scratch. As prompts and reasoning traces get longer, the cache grows token by token and becomes a major limit on inference cost, throughput, and maximum usable context. (huggingface.co) (arxiv.org) MEMENTO changes the pattern by teaching the model to write in blocks, then replace each finished block with a short summary the paper calls a memento. After that summary is produced, the earlier block is masked from attention and its key-value entries are flushed from memory. (arxiv.org) (microsoft.com) (github.com) The Microsoft team reported about 2.5 times lower peak key-value cache use and about 1.75 times higher throughput in its vLLM-based inference setup. The paper says the method held up across Qwen3, Phi-4, and Olmo 3 models ranging from 8 billion to 32 billion parameters on math, science, and coding benchmarks. (arxiv.org) (github.com) The training recipe adds supervision for where a model should pause and what compact state it should keep. Microsoft said it released OPENMEMENTOS, a dataset of 228,000 reasoning traces derived from OpenThoughts-v3, with block boundaries and intermediate summaries for that process. (arxiv.org) (github.com) The paper describes each cache entry as carrying two streams of information: exact local detail for the current block and compressed state for older work. That lets the model keep working from a shorter running context instead of dragging the full reasoning trace forward token by token. (arxiv.org) (microsoft.com) This arrives as model providers push longer context windows and heavier reasoning, both of which increase memory pressure during generation. Reviews of key-value cache optimization have described that cache as a central inference bottleneck, and MEMENTO takes a different route from token pruning or low-rank compression by changing what the model writes into memory in the first place. (arxiv.org) (fin.ai) (openreview.net) Microsoft published the paper and code in April 2026, along with a blog post explaining the method and a repository with data and vLLM extensions. The next test is whether model makers adopt the training recipe widely enough for long-context and reasoning-heavy deployments to treat compact summaries, not full traces, as the default memory format. (arxiv.org) (microsoft.com) (github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.