EverMind breaks 100M tokens
EverMind claims its MSA (Memory Sparse Attention) architecture breaks the 100‑million‑token barrier, enabling efficient long‑term memory for LLMs — a potential gamechanger for stateful AI services and personalized agent backends. If validated, memory‑efficient long contexts could reshape how we design session state and caching for on‑device and cloud hybrids. (prnewswire.com)
EverMind filed the MSA paper and accompanying press release on March 18–19, 2026 from San Mateo, Calif., listing the new architecture in PRNewswire distribution. (prnewswire.com) The MSA design fuses top‑k latent routing with a scalable sparse‑attention layer, document‑wise RoPE, KV‑cache compression, a Memory Interleave control loop, and a Memory Parallel inference engine. (github.com/EverMind-AI/MSA) EverMind’s repo and README show O(L) inference complexity and report less than 9% quality degradation when extrapolating from 16K to 100M token contexts on evaluated benchmarks. (github.com/EverMind-AI/MSA) The team describes a tiered storage pipeline with GPU‑resident routing keys, CPU‑resident compressed K/V stores, distributed scoring, and on‑demand transfers that the README says enable their 100M‑token throughput on two NVIDIA A800 GPUs. (github.com/EverMind-AI/MSA) MSA’s evaluations claim wins over same‑backbone RAG and best‑of‑breed RAG stacks on long‑context QA and Needle‑In‑A‑Haystack (NIAH) tasks, with explicit mention of MS MARCO in their benchmark suite. (github.com/EverMind-AI/MSA) EverMind positions MSA alongside its EverMemOS work (product iteration and a developer competition announced Feb. 3, 2026) and identifies as an incubation project under Shanda Group, founded by Tianqiao Chen. (prnewswire.com (shanda.com))