Inference = CPU + Memory Story
Industry spending is pivoting from GPU‑centric training to inference, which is boosting demand for CPUs and memory‑optimized platforms — memory bandwidth and efficient system designs are now the chokepoints for local LLM workloads. That shift makes memory‑centric architecture and kernel optimizations strategic priorities for teams building on‑device AI. (indexbox.io, xda-developers.com)
MLPerf Inference v4.1, published Aug. 28, 2024, added generative and mixture‑of‑experts benchmarks and explicitly framed inference performance as driving rapid hardware and software innovation across vendors. (mlcommons.org) (mlcommons.org) An NVIDIA Research limit study (arXiv 2507.14397, published July 18, 2025) measured transformer inference across systems and concluded memory bandwidth, memory capacity and synchronization overhead are the dominant bottlenecks for autoregressive LLM decoding. (arxiv.org) (arxiv.org) Hands‑on testing reported by XDA Developers found increases to GPU memory clock and bus width improved local LLM token throughput far more than core‑clock boosts, calling out VRAM capacity, memory speed and bus width as the primary throughput drivers for on‑device models. (xda-developers.com) (xda-developers.com) A Dell Technologies investigation running Llama 2 on a PowerEdge HS5610 with 4th‑Gen Intel Xeon processors documented repeated memory‑access stalls that directly reduced tokens‑per‑second during inference, tying observed slowdowns to memory subsystem behavior. (infohub.delltechnologies.com) (delltechnologies.com) Industry coverage in March 2026 reported NVIDIA pursuing a large, inference‑focused chip (reported ~$20 billion in development/licensing activity) while analysts and trade press highlighted Intel and other CPU vendors as likely beneficiaries as inference deployments scale. (msn.com) (msn.com, forbes.com) Apple’s March 5, 2025 M3 Ultra announcement specified unified memory configurations up to 512GB, and independent reporting measured roughly 800–819 GB/s of unified memory bandwidth on M3 Ultra systems—figures that enable larger models to remain resident for on‑device inference. (apple.com) (apple.com, macrumors.com)