Apple paper on depth‑wise MoE appears

Apple published an arXiv paper on depth‑wise Mixture‑of‑Experts (MoE) alignment that shows gains for predictable prefetch in a Flash‑MoE design, spotlighting hardware‑aware model research. The work signals Apple is exploring model architectures tuned for predictable memory and compute patterns. (x.com)

Apple’s M2R2 paper — “Mixture of Multi‑Rate Residuals for Efficient Transformer Inference” by Nikhil Bhendawade, Mahyar Najibi, Devang Naik and Irina Belousova — appeared on arXiv as arXiv:2502.02040 and was presented at ICLR 2025. (arxiv.org) M2R2 formalizes dynamic modulation of residual “velocity” to achieve earlier alignment of intermediate representations and reports up to 2.8× speedup on MT‑Bench in self‑speculative decoding and a 2.9× speedup when pairing early residual alignment with ahead‑of‑time expert loading into high‑bandwidth memory (HBM) for MoE models. (machinelearning.apple.com) The paper describes ahead‑of‑time expert loading and overlapping of expert memory transfers with computation as the mechanism that reduces expert‑switching bottlenecks in sparse MoE pipelines. (arxiv.org) Independent community work labeled Flash‑MoE demonstrated a streaming/flash approach that runs a 397B‑parameter Qwen variant on a Mac by packing experts, aggressive quantization, and predictable contiguous reads from external storage, with the GitHub repo and writeups describing execution on Apple hardware. (github.com) Apple’s MLX runtime and related research threads (MM1 multimodal MoE models, Omni‑Router shared routers, and the Apple Intelligence Foundation Language Models report that includes a Parallel‑Track MoE design) map a programmatic line from model architecture papers to hardware‑aware runtime work. (machinelearning.apple.com) Community adoption of MLX saw Ollama announce MLX integration on March 30, 2026, with reported prefill and decode speedups (prefill ~1.6×, decode ~1.9× in preview) that demonstrate real‑world demand for the exact predictable memory and expert‑loading patterns M2R2 targets. (ollama.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.