Offline Gemini (Gemma 4) lands on phones
A demo shows Google's Gemma 4 running fully offline on phones, enabling local AI inference without an internet connection. Running models locally changes device engineering considerations—privacy, latency and battery life—and opens new project ideas for mobile-first or edge applications. The demo underscores how edge execution is becoming realistic for phone apps and agent workflows. (x.com)
A phone used to treat artificial intelligence like a walkie-talkie. You spoke into an app, the app sent your words to a distant data center, and the answer came back a moment later. Now Google is showing a different setup: Gemma 4 running directly on phones, fully offline, with no internet connection in the loop. (blog.google) That shift sounds small until you picture where the work happens. In the old model, your phone was mostly a messenger. In the new one, the phone is the workshop. The model sits on the device, uses the device’s own memory and chips, and produces the answer locally. (deepmind.google) A language model is just a prediction engine that has learned patterns from enormous amounts of text, images, and sometimes audio. When you ask it a question, it guesses the next word, then the next one, over and over, fast enough to feel like conversation. (ai.google.dev) Running that engine on a phone is hard for the same reason editing a feature film on a pocket camera is hard. Phones have tight limits on memory, heat, and battery. A model that feels effortless in a cloud server can overwhelm a mobile device if it is too large or badly optimized. (developers.googleblog.com) Memory is the first wall. A model’s weights are the stored numbers that encode what it has learned, and those numbers have to fit into the device’s available memory before the model can respond smoothly. If the model is too big, it either runs slowly, drains power, or does not run at all. (blog.google) Heat is the second wall. When a phone keeps its processor busy for a long stretch, the device warms up and the system starts reducing speed to protect itself, much like a car engine backing off when it gets too hot. That means a model that starts fast can become sluggish if local inference is not carefully tuned. (developers.googleblog.com) Battery is the third wall. Every locally generated answer spends energy on the user’s own hardware instead of a remote server. That tradeoff can be worth it for privacy or speed, but it forces developers to think more like chip designers and less like web app builders. (android-developers.googleblog.com) The upside is privacy that is easier to explain in plain English. If the model runs entirely on the device, sensitive prompts, photos, or voice notes do not need to leave the phone just to get processed. That does not solve every security problem, but it removes one very large category of risk: sending raw user data to the cloud for each request. (play.google.com) The other upside is latency, which is just the delay between asking and getting a response. A local model does not have to wait for a network connection, a server queue, or a round trip across the internet. On a good connection that delay may be small, but on a train, in a basement, or on a plane, it can be the whole experience. (apps.apple.com) That is the background for this week’s news. Google introduced Gemma 4 on April 2, 2026, describing it as its most capable open model family so far and positioning it for local use across hardware that includes phones. Google says the lineup includes Effective 2 Billion, Effective 4 Billion, 26 Billion Mixture of Experts, and 31 Billion Dense variants. (blog.google) The “Effective” models are the key to the phone story. They are the smaller members of the family, built for environments where every gigabyte and every watt count. Google’s mobile deployment documentation now lists Gemma 4 for mobile devices and points developers to tools for running it on-device. (ai.google.dev) The public demo that pushed this story into wider view shows Gemma 4 running offline on a phone through Google’s AI Edge Gallery app. Google’s Android and Apple store listings for AI Edge Gallery both now say the app features Gemma 4 and runs models directly on the device with offline, private processing. (play.google.com) (apps.apple.com) AI Edge Gallery is not just a chat window. Google describes it as a place to run open models on mobile hardware, and the store listings mention features such as image questions, audio transcription, and local chat. That matters because it turns offline artificial intelligence from a lab trick into something closer to an app platform. (github.com) (play.google.com) Google is also connecting Gemma 4 to more agent-like behavior, where a model does not only answer a prompt but helps carry out multi-step tasks. In its Android developer materials, Google says Gemma 4 supports tool use and was designed with agent mode in mind, while keeping inference local. On a phone, that points toward assistants that can reason over personal context without constantly shipping that context to remote servers. (android-developers.googleblog.com) There is still a ceiling. The biggest Gemma 4 models are meant for much stronger hardware, and even favorable demos do not mean every phone can run every model well. Google’s own materials frame the smaller versions as the mobile path, while the larger variants target workstations and more capable local systems. (blog.google) (deepmind.google) But the important change is that “offline phone model” is no longer a science project phrase. Google now has official apps, official documentation, and an official model family built around the idea that useful artificial intelligence can live on the device itself. That opens up a different kind of software design, where a travel app can still summarize notes without signal, a field worker can transcribe audio without uploading it, and a mobile assistant can respond instantly because the computer is already in your hand. (ai.google.dev) (github.com)