Local LLMs Show Performance Gaps on Apple Silicon
Real-world tests of local AI code assistants on Apple's M5 silicon reveal sluggish out-of-the-box performance, with speeds of just 5–10 tokens per second. Achieving optimal results requires significant engineering effort to tune memory bandwidth and GPU utilization. The findings underscore that raw hardware power alone is insufficient for peak on-device AI performance without deep, low-level software optimization.
- Apple's MLX framework is central to performance, designed specifically to exploit the unified memory architecture of Apple Silicon by allowing the CPU and GPU to access the same data without inefficient memory copies, a common bottleneck for frameworks not optimized for the hardware. - The primary performance constraint for generating subsequent tokens in a response is memory bandwidth, not just raw compute. For example, the M5 chip's 19-27% performance increase over the M4 in token generation directly correlates with its ~28% higher memory bandwidth (153 GB/s vs 120 GB/s). - While Apple Silicon offers a significant advantage in memory capacity for cost—allowing a Mac Studio with 512GB of unified memory to run models larger than 400 billion parameters—high-end NVIDIA GPUs often still lead in raw token-per-second processing speed for smaller models where their dedicated VRAM and mature CUDA ecosystem excel. - Apple's dedicated Neural Engine is largely underutilized for general-purpose LLM inference, as it was designed for smaller, statically-scheduled operations, not the more dynamic computations required by large transformer models. - The key to running larger, more capable models is quantization, a technique that compresses model weights to a lower precision (e.g., 4-bit). This reduces the memory footprint and bandwidth requirements, allowing a 70-billion parameter model that would normally be too large to fit and run efficiently. - Performance benchmarks distinguish between time-to-first-token (TTFT), which is compute-bound, and subsequent token generation, which is memory-bound. Optimizing the full user experience requires addressing both aspects through deep software and hardware co-design. - The unified memory architecture avoids the "PCIe bottleneck" seen in traditional PC builds, where data must be transferred between system RAM and the GPU's VRAM, which can become a major performance limiter once a model's memory requirements exceed the dedicated VRAM. - Apple's strategic focus on on-device AI for privacy, offline capability, and responsiveness makes optimizing local LLM performance a core business imperative. This aligns hardware advancements in Apple Silicon directly with system-level software features like Apple Intelligence.