Custom ML runtime details
A deep dive surfaced on a custom ML runtime for Apple Silicon using MLX and ANE private APIs with heterogeneous scheduling to run LLMs like GPT‑2 efficiently on device. The writeup connects low‑level hardware tricks with practical deployment strategies for local inference. (x.com)
AtomGradient’s open benchmark compares MLX-only, Core ML (ANE) prefill + MLX decode, and fully ANE approaches on Qwen‑3.5 and documents crossover points and four concrete inference pipelines. (atomgradient.com) Their measurements show ANE prefill “matches GPU” performance at roughly 410 input tokens and reports a ~282× reduction in GPU power draw during the prefill stage. (atomgradient.com) A concurrent AtomGradient experiment (ANE batch prefill) achieved an 11.3× batch dispatch speedup (268 tok/s) and recorded a 27 ms time‑to‑first‑token (TTFT) on multi‑turn conversations when running a fused ANE pipeline. (atomgradient.com) For larger models the hybrid approach can lose: their 9B model runs showed the hybrid ANE+GPU pipeline slower than a GPU‑only MLX baseline, attributed to 4‑chunk Core ML dispatch overhead and an FP16→8‑bit KV cache bridge that reduced decode throughput by ~11–16%. (github.com) Independent projects expose ANE beyond Core ML: the maderix/ANE repo demonstrates reverse‑engineered private ANE kernels and reports transformer backpropbenchmarks of 9.3 ms/step and ~1.78 TFLOPS sustained on an M4 for dim=768, seq=512 workloads. (aibit.im) MLX is Apple’s array framework designed for unified‑memory CPU/GPU execution on Apple Silicon but currently lacks native ANE execution, which forces deployers to bridge Core ML stateful models or use private APIs for ANE participation. (mlx-framework.org) Practical constraints flagged in the writeups include Core ML’s fixed‑shape requirement, macOS 15/iOS 18 stateful model features for KV cache, and the need to pack all per‑layer KV state into a single concatenated tensor to avoid ANE compile failures. (blog.squeezebits.com)