Gemma 4 runs on iPhone
Developers report CoreML‑LLM v0.5.0 can run Google's Gemma 4 E2B on an iPhone Neural Engine at roughly 31 tokens/second decode with a reported 99.78% ANE utilization. (x.com) (x.com)
A language model predicts the next chunk of text, one token at a time, and developers now say Google’s Gemma 4 E2B can do that on an iPhone’s Neural Engine. (github.com) (developer.apple.com) The reported setup uses CoreML-LLM version 0.5.0, an open-source project by developer Daisuke Majima, to run Gemma 4 E2B through Apple’s Core ML stack on an iPhone 17 Pro. The project README lists about 31 tokens per second for decode and about 154 tokens per second for prefill. (github.com 1) (github.com 2) CoreML-LLM says it sends the model to the Apple Neural Engine, the dedicated artificial intelligence block in Apple chips, instead of leaning mainly on the graphics processor. The README reports 99.78 percent Neural Engine placement, measured as 7,294 of 7,310 dispatched language-model operations on the Neural Engine. (github.com 1) (github.com 2) Google released Gemma 4 on March 31, 2026, with E2B and E4B edge models aimed at phones, browsers, and other small devices. Google’s documentation says those smaller models support text and image input, and the E2B and E4B variants also feature native audio and video support. (ai.google.dev 1) (ai.google.dev 2) That hardware target matters because Core ML is Apple’s system for running machine-learning models on a device’s central processor, graphics processor, and Neural Engine without a server call. Apple says on-device execution can cut network dependence and keep data on the device. (developer.apple.com) The CoreML-LLM project frames that tradeoff directly: developers who want maximum throughput may still prefer Apple’s graphics path, while developers who want the model to live on the Neural Engine can keep the graphics processor free. The sample app supports Gemma 4 text chat and image understanding on iOS 18 or later devices. (github.com) (github.com) The model is not tiny. CoreML-LLM lists a pre-converted Gemma 4 E2B package at 2.7 gigabytes, and the README says the app’s physical memory footprint is about 1 gigabyte during inference on the tested phone. (github.com) The project also notes that the vision encoder still runs on the graphics processor by design, so the demo is not a claim that every part of a multimodal model sits on the Neural Engine. Apple has published separate guidance for adapting transformer-style models to the Neural Engine, with examples dating back to A14 and M1-class chips. (github.com) (github.com) For now, the result is a developer benchmark, not an Apple product announcement or a Google shipping app. But it shows that a model family Google released less than two weeks ago is already being converted to run locally on an iPhone, with most language-model operations reportedly landing on Apple’s Neural Engine. (ai.google.dev) (github.com)