On‑device limits spelled out

- Analysts and podcasts listed practical on‑device constraints: memory footprint, battery, thermal limits and quantization quality. (x.com) - They also argued distribution and platform integration often matter more than headline model size for real user experiences. (x.com) - The advice for builders is concrete: focus on compression, caching and UX to make local inference reliable on phones. (x.com)

Running an AI model on a phone is less about the headline parameter count than whether the device can keep the model in memory, feed it fast enough, and avoid overheating. (machinelearning.apple.com) On-device AI means the model runs locally instead of sending prompts to a remote server. Apple said its current on-device Apple Intelligence model is about 3 billion parameters and uses tricks including KV-cache sharing and 2-bit quantization-aware training to fit and run on Apple silicon. (machinelearning.apple.com) Google is making the same pitch on Android, where Gemini Nano runs through the AICore system service. Google said AICore manages model updates and safety, while local execution can cut network delay and keep data on the device, but speed still depends on the hardware in the phone. (developer.android.com) The hardware problem is straightforward: phones have tight power and heat budgets. Qualcomm said generative AI needs a mix of Neural Processing Unit, central processing unit, and graphics processing unit resources, and that choosing the right processor mix is what preserves thermal efficiency and battery life during inference. (qualcomm.com) Memory is a second bottleneck, because a model has to fit not just its weights but also temporary working data while it generates tokens. Qualcomm said newer on-device techniques aim to reduce memory bandwidth as well as storage, and Apple highlighted cache-sharing and low-bit training for the same reason. (qualcomm.com) (machinelearning.apple.com) That is why compression and quantization keep coming up. Qualcomm’s materials describe the target as shrinking models from 16-bit to 4-bit-style representations to cut size and latency, while warning that aggressive post-training quantization can damage accuracy if it is not handled carefully. (qualcomm.com) The software layer can matter as much as the model. Apple’s Foundation Models framework gives developers access to the on-device model with built-in guided generation, tool calling, streaming, and session management, while Google is steering Android developers toward higher-level ML Kit GenAI APIs on top of AICore. (developer.apple.com) (developer.android.com) That shifts the contest from “who has the biggest model” to “who can ship the most reliable feature.” Apple said developers can plug model features into apps and system surfaces through App Intents, and Google said developers should choose between on-device and cloud tools based on task complexity, input size, privacy needs, and offline requirements. (developer.apple.com) (developer.android.com) In practice, the winning phone feature is often the one that starts quickly, streams partial results, survives weak connectivity, and does not drain 10% of the battery in a few minutes. The companies building the platform stacks are now exposing exactly those controls, which is why builders keep talking about compression, caching, and user interface design instead of raw model size. (developer.apple.com) (developer.android.com)

On‑device limits spelled out

Get your own daily briefing