Layer attention breakthrough
Researchers are pitching 'Attention Residuals' — using softmax attention across layers (and 'block' attention to keep compute feasible) so deep models can selectively retrieve earlier layer information instead of drowning it out. (youtube.com)
The Kimi Team — led by Guangyu Chen and listing 36 authors — submitted "Attention Residuals" to arXiv on March 16, 2026 (arXiv:2603.15031, ). Moonshot AI's official GitHub repo hosts the paper PDF and reference code and shows roughly 1.3k stars and 67 forks on the repository at the time of publication (github.com). The authors report integrating their method into a "Kimi Linear" setup with 48 billion total parameters and 3 billion activated parameters, pre-trained on 1.4 trillion tokens (arxiv.org). Downstream benchmark numbers in the repo list MMLU rising from 73.5 to 74.6, GPQA‑Diamond rising from 36.9 to 44.4 (+7.5), BBH from 76.3 to 78.0, and HumanEval from 59.1 to 62.2, with the largest gains on multi‑step reasoning and code tasks (github.com). The paper and README describe production-oriented optimizations — a two‑phase computation schedule, cache‑based pipeline communication, and use of RMSNorm to control layer-output magnitudes — to make the approach practical for large distributed training ( ). A PyTorch RFC proposing an nn.BlockAttentionResidual operator was opened on March 16, 2026 and explicitly references the paper while calling out infrastructure needs such as online softmax merging and cross‑stage caching for efficient core integration (github.com).