Apple NPU benchmarks: M5 Max claims
Benchmarks circulating report the M5 Max NPU hitting near‑theoretical 40 TOPS for ML workloads, while M3 Ultra remains strong in TFLOPS for floating‑point tasks—details developers can use for on‑device performance planning. The numbers underline why teams are optimizing Core ML and Metal paths for Apple silicon. (x.com)
ANEMLL’s anemll-bench repo lists a cross‑generation table where the M5 registers 70.21 GB/s memory bandwidth and a 6.10 ms inference time on the llama_lm_head test, placing it between M2 and M3 Max in those specific measurements. (github.com/Anemll/anemll-bench/blob/main/Results.MD) Apple’s announcement for M5 Pro and M5 Max describes the new Fusion multi‑die architecture, an 18‑core CPU configuration, and a GPU design that scales up to 40 cores with a Neural Accelerator inside each GPU core. (apple.com/newsroom/2026/03/apple-debuts-m5-pro-and-m5-max-to-supercharge-the-most-demanding-pro-workflows/) Early Geekbench 6 entries show an M5 Max multi‑core CPU score of 29,233 compared with an M3 Ultra entry at 27,726, demonstrating the new silicon’s lead in raw CPU throughput in public leak results. (macrumors.com/2026/03/05/m5-max-geekbench-benchmarks/) Public hardware spec aggregators and FP32 calculations place the M3 Ultra GPU’s theoretical single‑precision throughput in the high‑20 TFLOPS range (roughly 28 TFLOPS FP32 in published tables), which explains why the M3 Ultra still leads in floating‑point shader throughput for certain workloads. (cpu-monkey.com/en/igpu-apple_m3_ultra_80_core) (hmc-tech.com/gpu/apple-80-core-m3-ultra) Apple’s developer docs and Metal 4 “machine learning passes” explicitly enable running Core ML models inside Metal workflows and expose Tensor APIs for per‑core neural accelerators, which is the exact stack vendors reference when optimizing between Core ML, MPS, and direct Metal tensor paths. (developer.apple.com/machine-learning/core-ml/) (developer.apple.com/documentation/metal/machine-learning-passes) Independent roofline analyses measuring the M5 GPU show large gaps between Core ML’s effective GFLOPS and raw Metal compute shader throughput, and early community MLX/ML benchmarks demonstrate large‑model LLM inference running on 128 GB M5 Max configurations (examples include Qwen3.5‑122B tests), which is driving teams toward Metal/tensor‑level optimization for maximum on‑device utilization. (michaelstinkerings.org/apple-m5-gpu-roofline-analysis/) (hardware-corner.net/m5-max-local-llm-benchmarks-20261233/)