card

Qwen3.5 runs on iPhone Neural Engine at ~48 tok/s for 0.8B model

- Developers showed Qwen3.5 small models running directly on iPhone Neural Engine this week, with the 0.8B variant decoding around 48 tokens per second. - The eye-catching detail is the 2B model too: roughly 27 tokens per second, while staying fully on the Apple Neural Engine. - That matters because tiny open models are crossing from “possible on phones” into genuinely usable local assistants.

Phone LLMs have mostly lived in the demo zone. They could run, but not fast enough to feel natural. That is why these Qwen3.5 iPhone tests matter. Developers are showing Alibaba’s Qwen3.5 small models decoding on the iPhone Neural Engine at speeds that look usable, not just technically cute. The reported numbers are about 48 tok/s for Qwen3.5-0.8B and about 27 tok/s for Qwen3.5-2B, with the models staying on the Apple Neural Engine rather than bouncing work back to CPU or GPU. (huggingface.co) ### What actually ran on the phone? The models are Qwen3.5’s small variants — 0.8B and 2B — from the broader Qwen3.5 family that also includes 4B, 9B, 27B, 35B-A3B, 122B-A10B, and 397B-A17B releases. The important part is not just the brand name. It is that these are genuinely tiny modern models, small enough to fit on mobile hardware while still being recent enough to matter for real apps. (huggingface.co) ### Why is 48 tok/s a big deal? Because once a model gets into that range, chat stops feeling sluggish. You are no longer waiting on every sentence like it is coming through a straw. For a 0.8B model, 48 tok/s is fast enough for snappy UI, autocomplete-style interactions, and simple agent loops. Even 27 tok/s for a 2B model is solid on a phone — (huggingface.co)atch is that tokens per second is not intelligence. A 0.8B model is still a small model. But speed is what turns a local model from novelty into product surface. (machinelearning.apple.com) ### Why does the Neural Engine matter? Apple’s Core ML stack is built to push model execution onto the CPU, GPU, and Neural Engine in a way that minimizes latency, memory footprint, and power draw. Apple is also openly pitching Core ML as a path for running advanced generative models on-device, with model compressio(machinelearning.apple.com) phrase here — it means the phone is using the hardware block designed for this kind of inference, which is usually where you get the best mix of speed and battery behavior. (developer.apple.com) ### Why Qwen3.5 specifically? Turns out Qwen3.5 is unusually well positioned for this moment. The family includes very small open-weight models, and Apple Silicon tooling already supports Qwen3.5 through MLX for Apple hardware. That does not prove the iPhone demo used MLX end to end — MLX is mainly the Mac-side story — but it does sho(developer.apple.com)se model releases only become real products once somebody can quantize, convert, and ship them. (github.com) ### Does this mean phones can replace cloud AI? Not really. Small models still hit a ceiling on reasoning depth, coding reliability, and long multi-step tasks. Apple’s own on-device foundation model is only around 3B parameters, and even Apple pairs that with larger server-side models when the job needs more horsepower. Basically, the split is becoming clearer: p(github.com)cloud handles the hard stuff. (machinelearning.apple.com) ### So what changed? The change is not that on-device AI exists. It is that the speed-quality tradeoff is getting good enough on mainstream hardware to support real software. A sub-2B open model running locally on an iPhone at conversational speed means offline assistants, private summarizers, and app-n(machinelearning.apple.com). But once a phone can run a current small model at 27 to 48 tok/s on dedicated silicon, local AI stops looking like a fallback and starts looking like the default.

Qwen3.5 runs on iPhone Neural Engine at ~48 tok/s for 0.8B model

Get your own daily briefing