On‑device ML speedups reported
CoreML‑LLM v0.8.0 ships Gemma 4 E4B (4B‑effective decoder) running at roughly 14 tokens/second on an iPhone 17 Pro Neural Engine in a demo, showing full NE utilization. (x.com) Separately, Apple’s M4 Neural Engine is reported as faster in production apps such as Draw Things, indicating improved on‑device acceleration in the field. (x.com)
A phone’s Neural Engine is a dedicated chip for machine-learning math, and new demos suggest Apple devices are getting faster at running those models locally. (github.com, apple.com) Running a model locally means the prompt, weights, and generated text stay on the device instead of being sent to a server. Apple has long pushed that setup with the Apple Neural Engine, or ANE, which it says is built to accelerate artificial-intelligence workloads. (github.com, apple.com) CoreML-LLM, an open-source project for Apple hardware, says its latest work targets the Neural Engine rather than the graphics processor, or GPU, so the GPU can stay free for other tasks. Its README says Gemma 4 E2B reaches about 31 tokens per second on an iPhone 17 Pro at a 2,048-token context length, with 99.78% of dispatched large-language-model operations placed on the ANE. (github.com) Google released Gemma 4 on March 31, 2026, including E2B and E4B “effective parameter” models aimed at smaller hardware. Google’s model overview says those 2B and 4B effective-parameter versions are built for ultra-mobile, edge, and browser deployment. (ai.google.dev, ai.google.dev) That matters because local AI on phones and tablets has usually meant tighter limits on speed, memory, or battery life than cloud systems. CoreML-LLM says its iPhone 17 Pro setup uses about 1 GB of physical memory during inference and keeps roughly 5 GB available on an 8 GB device. (github.com) Apple has also been widening the hardware budget for this kind of work. When it introduced M4 on May 7, 2024, Apple said the chip’s 16-core Neural Engine could deliver up to 38 trillion operations per second and called it its fastest Neural Engine at the time. (apple.com) The field evidence is starting to show up in shipping apps, not just lab demos. Draw Things, an offline image-generation app for iPhone, iPad, and Mac, added “Apple Neural Engine support for M4” in its April 11, 2026 release notes. (drawthings.ai, drawthings.ai) Draw Things’ release notes also point to a pattern across Apple’s recent chips: a March 24, 2026 update said it improved M5 performance by 2% to 10% through heavier use of “Neural Accelerators,” while a November 18, 2025 update said it had to mitigate performance regressions on macOS 26 for M1 through M4 devices. (drawthings.ai) Apple’s own transformer reference code, published for A14 and newer iPhones and M1 and newer Macs, said ANE-optimized deployments could reach up to 10 times faster speed and 14 times lower peak memory use than baseline implementations in its case study. Those figures are older and model-specific, but they help explain why developers keep chasing fuller Neural Engine utilization. (github.com) The immediate story is not that every model now runs well on a phone. It is that more of the work is moving onto Apple’s dedicated machine-learning hardware, and developers are beginning to show that shift in public demos and production release notes. (github.com, drawthings.ai, apple.com)