Local LLM Performance Varies on Apple Silicon
Running large language models locally on Apple's M5 chips reveals significant performance differences based on the software stack used. Out-of-the-box MLX inference is reportedly slow, achieving only 5-10 tokens per second. Achieving real-time performance for applications like coding assistance requires targeted optimization of memory bandwidth and GPU utilization, highlighting the need for hardware-specific software tuning.
- Apple's MLX framework is specifically designed to leverage the Unified Memory Architecture (UMA) of M-series chips, which allows the CPU and GPU to share the same memory pool. This eliminates the need to copy data between separate system RAM and GPU VRAM, reducing latency and making it feasible to run larger models than what typically fits into the VRAM of consumer-grade discrete GPUs. - In direct comparisons running a Llama-2 7B model with 4-bit quantization, the C++ based `llama.cpp` framework can outperform MLX, achieving around 61 tokens/second for generation, while MLX achieves approximately 31 tokens/second. However, for non-quantized (FP16) models, MLX's performance is much closer to `llama.cpp`, indicating trade-offs between Python framework flexibility and optimized C++ performance. - The performance jump between Apple Silicon generations is significant for AI workloads; benchmarks show the M2 Max is nearly five times faster in BERT-base model inference than the M1. More recently, the M5 chip's GPU neural accelerators provide a 3x or more speedup in time-to-first-token compared to the M4 for various language models running on MLX. - While Apple Silicon is closing the gap, high-end NVIDIA GPUs remain the leader in raw performance for many operations. For example, in a BERT-base inference test, an NVIDIA A10 GPU averaged 23.46 milliseconds, compared to 38.23 milliseconds on an M2 Max. The value proposition for Apple devices is local development and inference with high power efficiency, not replacing dedicated data center hardware. - The broader AI chip market is experiencing a "supercycle," with AI-focused semiconductor companies seeing valuations surge while traditional sectors decline. NVIDIA's data center revenue hit $30.8 billion in Q3 2025, a 112% year-over-year increase, fueled by the demand for High Bandwidth Memory (HBM) and advanced packaging like TSMC's CoWoS. - For enterprise ML teams, MLOps tooling is a key consideration for managing inference costs. Open-source tools like MLflow and DVC are used for experiment tracking and data versioning to ensure reproducibility and reduce wasted compute cycles. Cloud platforms like AWS SageMaker provide more integrated solutions for the entire model lifecycle, from training to deployment and monitoring. - AI is being heavily integrated into the GTM toolchain, with sales enablement platforms like Highspot and Salesforce Einstein using it for predictive analytics, content recommendations, and analyzing sales calls to provide coaching insights. Companies are seeing revenue uplifts of up to 15% by investing in AI-enabled sales and marketing tools.