Apple Silicon ML tweak wins
Developers report that parallelizing Apple Neural Engine plus GPU on an M4 Mac mini gives about a 1.14x inference speedup on Qwen‑14B — but power draw varies widely by compute unit, underscoring optimization work still needed. Those early results highlight practical tradeoffs for on‑device model placement and power budgeting. (x.com) (x.com)
Developers behind the recent M4 experiments surfaced open-source projects that implement hybrid ANE+Metal/MLX inference paths alongside reverse‑engineered ANE hooks for transformer workloads (ane‑infer; maderix/ANE). (github.com) The maderix ANE project publishes microbenchmarks showing training‑path measurements of ≈9.3 ms per step and a sustained 1.78 TFLOPS figure for ANE on M4 hardware, documenting private‑API plumbing and kernel work. (rits.shanghai.nyu.edu) Community MLX/Metal work used for Qwen variants reports large throughput gains—MLX‑compiled builds and MLX quantization workflows have been shown to deliver ~21–87% higher throughput versus baseline Apple‑Metal backends and up to ~2× speedups on some Qwen builds. (dev.to) Independent M4 system profiling shows Mac mini M4 platform power envelops under real workloads often sit below ~50 W in published reviewer tests, while hybrid‑dispatch experiments and modeling emphasize that ANE and GPU have distinct sustained power/thermal envelopes that change system‑level peak draw. (beebom.com) Repository commits and benchmarks from these projects document practical engineering levers used in the hybrid approach: KV‑cache placement into unified memory, selective layer offload to ANE, and INT8/W8A8 quantize–dequantize tricks that the authors report can raise ANE throughput (examples show ~1.88× ANE throughput with W8A8 pipelines). (github.com) Systematic profiling work on Apple Silicon calls out unified‑memory bandwidth and kernel launch overhead as limiting factors for aggregate throughput, and community model‑quantization uploads show concrete VRAM reductions (e.g., MLX mxfp4 quantizations that shrink 14B footprints from ~30 GB to ~10–11 GB), reinforcing why layer‑level placement and memory budgeting matter for M4 Mac mini deployments. (arxiv.org)