EverMind breaks 100M tokens
EverMind announced an MSA (Memory Sparse Attention) architecture that it says enables efficient end-to-end long-term memory and breaks the 100M token context barrier—opening up new long-history RAG and session-persistence use cases. That claim, if validated, changes how enterprise search can handle massive conversation histories. (prnewswire.com)
EverMind posted its research paper on March 18 and followed with a PR release from San Mateo on March 19, 2026. (prnewswire.com) The project's GitHub repo hosts an MSA implementation and README claiming near‑linear O(L) complexity and under 9% performance degradation when scaling from 16K to 100M token contexts. (github.com) Core engineering components listed in the repo include top‑k selection fused with sparse attention, document‑wise RoPE (parallel/global), KV‑cache compression, a Memory Parallel inference engine, and a Memory Interleave mechanism for multi‑round multi‑hop reasoning. (github.com) EverMind states the Memory Parallel design uses GPU‑resident routing keys with CPU‑resident content K/V plus on‑demand transfers to enable extreme‑scale inference on a 2×NVIDIA A800 setup. (github.com) The paper evaluates MSA on long‑context QA benchmarks and a Needle‑In‑A‑Haystack (NIAH) workload with context sweeps reported up to 1M tokens, and the repo claims results that surpass same‑backbone RAG stacks and leading long‑context models. (prnewswire.com) EverMind frames MSA as a trainable, end‑to‑end latent‑state memory framework intended as a “memory plug‑in” for models, emphasizing differentiability and scalability as design goals. (github.com) Coverage links EverMind to Shanda Group’s larger “Discoverative AI” strategy in some outlets, and the repo plus paper are publicly available for inspection on GitHub. (ai-watch.jp)