Snapdragon, Tensor keep data local
- Google and Qualcomm now have official paths for running generative models directly on phones and PCs, using Gemini Nano and Hexagon NPUs without cloud calls. - The practical detail is the tooling: Android’s AICore and ML Kit Prompt API on one side, Qualcomm AI Hub and Genie with 4-bit quantization on the other. - That shifts the mobile AI tradeoff — less latency, lower serving cost, better privacy, but smaller models and tighter device-memory limits.
Phone AI is starting to split into two very different products. One lives in the cloud and feels bigger, smarter, and more expensive. The other runs on the device itself and feels faster, more private, and much cheaper to serve. What changed is that both Google and Qualcomm now have real developer plumbing for the second path — not just demos, but shipping APIs and toolchains that let apps run generative models locally on Gemini Nano or Snapdragon NPUs. ### What does “on-device” actually mean? It means the prompt, the inference step, and the output stay on the phone or PC instead of being sent to a remote server. Google’s current Android stack does this through AICore, the system service that hosts Gemini Nano, while Qualcomm exposes local execution through its AI runtimes and Hexagon NPU tooling. Basically, the model is near the user, not in a data center. (developer.android.com) ### Why are Snapdragon and Tensor the chips people keep naming? Because those chips have dedicated neural hardware built for this exact job. Google’s Tensor devices use AICore to tap device hardware for low-latency Gemini Nano inference, and Qualcomm’s stack can target CPU, GPU, or the Hexagon NPU, with the NPU being the efficient path for sustained local AI. That hardware piece matters — a general CPU can run a model, but usually not with the same speed-per-watt. (developer.android.com) ### What is new on Google’s side? Google has moved beyond a narrow set of canned features. ML Kit already offered on-device summarization, rewriting, proofreading, image description, and speech recognition with Gemini Nano. Then Google added the ML Kit Prompt API in alpha, which gives developers a more open-ended way to send custom text and multimodal prompts to Gemini Nano locally. That is the jump from “use Google’s preset feature” to “build your own app behavior.” (developer.android.com) ### What is new on Qualcomm’s side? Qualcomm’s pitch is the full deployment pipeline. AI Hub can take a model, compile it, quantize it, validate it on Qualcomm hardware, and export binaries for local runtime. Its Genie workflow is explicitly aimed at LLM deployment, including splitting big models into components and using 4-bit quantization for runtime efficiency. In plain English — Qualcomm is trying to make “bring your own local model” less painful. (developers.google.com) ### Why does privacy improve? Because fewer things leave the device. Google is unusually explicit here: on-device execution removes server calls, keeps sensitive data local, and AICore is isolated with restricted package access and no direct internet access. That does not make every app magically safe — the app developer still controls what gets collected or uploaded elsewhere — but the model step itself can stay local. (aihub.qualcomm.com) ### Why does latency and cost improve? The obvious win is no round trip to the cloud. Responses can start faster, and the app maker is not paying per request for model inference on a server. Google says on-device GenAI avoids additional server cost per API call, and Qualcomm’s whole stack is built around local execution on device hardware. For features like rewriting text, classifying a photo, or summarizing a note, that can be the difference between a neat demo and a feature you can afford to leave on. (developer.android.com) ### So what’s the catch? Small local models are still small local models. Qualcomm’s own deployment guide gives a rough sense of the constraints — about 12 GB RAM for 3B models and 16 GB for 7B models on validated setups — which tells you why quantization and model splitting matter so much. Google also notes that performance depends on device hardware, and its newest Prompt API currently performs best on the Pixel 10 series. So yes, local AI is real, but it is still bounded by memory, thermals, and the specific chip in the user’s hand. (developers.google.com) ### Where does this leave product teams? It makes the decision sharper. If a feature needs the biggest possible model or broad world knowledge, cloud inference still wins. But if the job is narrow, frequent, privacy-sensitive, or latency-sensitive, local inference is suddenly a very practical default. The bottom line is that Snapdragon and Tensor are turning “keep the data on the device” from a nice slogan into a product architecture choice. (developer.android.com) (docs.qualcomm.com)