Apple's Mamba distillation
- An Apple paper described a two‑stage distillation from Transformers into Mamba SSMs for linear‑time inference. - The paper reports near‑teacher perplexity on 1B‑parameter models using the distillation technique. - The approach promises long‑context, lower‑compute inference patterns that could change how open models are served. (x.com)
Most language models still use attention, a token-by-token lookup that gets expensive as prompts grow. Apple researchers say they can transfer a 1 billion-parameter Transformer into a Mamba model that runs with linear-time sequence processing. (arxiv.org) The paper, “Attention to Mamba: A Recipe for Cross-Architecture Distillation,” was posted on arXiv on April 1, 2026 by Abhinav Moudgil, Ningyuan Huang, Eeshan Gunesh Dhekane, Pau Rodríguez, Luca Zappella, and Federico Danieli, with Apple listed as the primary affiliation. (arxiv.org) Their method uses two steps instead of a direct conversion. First it distills a standard Transformer into a linearized attention model using an adapted kernel method; then it distills that intermediate model into a Mamba variant with no attention blocks. (arxiv.org) That middle step matters because past work found that a straight Transformer-to-Mamba distillation usually loses too much quality. Apple’s paper says its initialized Mamba student keeps downstream performance close to the original Pythia-1B teacher. (arxiv.org) On the paper’s headline metric, the distilled Mamba reached perplexity of 14.11 versus 13.86 for the Pythia-1B teacher. The authors say they ran ablations at the 1 billion-parameter scale over 10 billion distillation tokens and tested how token allocation between the two stages changed results. (arxiv.org) Mamba is a state space model, a sequence architecture that replaces the full attention matrix with a running internal state. Its appeal is lower memory use and higher generation throughput than attention-based models, especially as context windows get longer. (github.com, huggingface.co) The Mamba family was introduced by Albert Gu and Tri Dao as a linear-time alternative to Transformers, and the reference implementation says it was built for efficient hardware-aware sequence modeling. Hugging Face’s documentation says the original Mamba paper reported up to 5 times higher inference throughput than Transformers and scaling to million-length sequences. (github.com, huggingface.co) Apple’s bridge model draws on Hedgehog, a linear attention approach that tries to mimic softmax attention while keeping linear complexity. That gives the distillation pipeline an intermediate architecture that is closer to a Transformer before the final jump into Mamba. (arxiv.org, arxiv.org) Apple has been publishing around Mamba for more than a year. A separate Apple Machine Learning Research post on “Understanding Input Selectivity in Mamba” says the company has also studied how Mamba’s selective state updates work inside the model itself. (machinelearning.apple.com) The paper does not claim a new foundation model release. It makes a narrower argument: pretrained Transformer know-how can be reused to build attention-free Mamba students that stay close to the teacher, which would make long prompts cheaper to serve if the recipe holds up beyond the reported 1 billion-parameter setting. (arxiv.org)