Gemma 4 at the Edge

Google's Gemma 4 is moving beyond research weights into distribution: Cloudflare Workers AI now offers Gemma 4 26B A4B for edge inference, and Google’s AI Edge Gallery app reportedly runs Gemma models locally on Android and iOS for offline text, image and audio tasks. That combination expands where inference can live—central cloud, edge nodes, or on-device—and changes low-latency and data‑sovereignty options for deployments. (thetechoutlook.com) (moneycontrol.com)

Google’s Gemma 4 was introduced as an open model family. That mattered to researchers. This week it started to matter to builders. Cloudflare added Gemma 4 26B A4B to Workers AI on April 4, which means developers can now call the model from Cloudflare’s edge network instead of only running it in a central cloud or on their own machines. At almost the same moment, Google pushed Gemma 4 into its AI Edge Gallery app for Android and iOS, where smaller Gemma variants can run directly on a phone with no network connection at all. The interesting part is not either launch by itself. It is the map they create together. In a matter of days, the same model family now spans the data center, the edge, and the handset. That changes the practical meaning of “deployment.” For years, most AI systems lived in one place: a remote server farm. Prompts went in. Tokens came back. Gemma 4 breaks that neat arrangement into layers. Cloudflare’s version sits close to users, on infrastructure designed to cut round-trip delay and keep workloads near the geography where they originate. Google’s phone app pushes the idea further. In AI Edge Gallery, inference happens on the device hardware itself, offline, with prompts and images never leaving the phone. Once a model is loaded, the app is meant to work without internet access. The model design is what makes this spread possible. Gemma 4 comes in four sizes, including a 26B A4B Mixture-of-Experts model and smaller E2B and E4B variants. The 26B A4B model has 26 billion total parameters, but only about 4 billion are active on each forward pass. That is the trick. It behaves like a much larger model without paying the full compute cost of a dense one every time. Cloudflare is leaning hard on that efficiency pitch. Its changelog says the model runs almost as fast as a 4B model while aiming for the quality of a much larger system. Google is making a parallel argument from the other end of the stack. The company says Gemma 4 was built to maximize intelligence per parameter, with deployment targets that range from high-end phones to workstations and servers. The family supports up to a 256,000-token context window, multimodal input for text and images across the line, and native audio support on the smaller models. Google also shifted Gemma 4 to an Apache 2.0 license, which lowers the friction for commercial use and makes distribution through third-party platforms easier to understand. The AI Edge Gallery app shows what that looks like when it lands in a consumer device. The app now advertises Gemma 4 support as its headline feature. It offers chat, image understanding, audio transcription and translation, prompt testing, benchmarking, and a new “Agent Skills” mode that can chain tools together on-device. Google’s own description is blunt about the point of the app: all inference happens on local hardware, no internet required, with privacy coming from the fact that prompts and media stay on the phone. That does not mean every Gemma 4 experience is now magically local. The largest models still fit better on edge servers and GPUs than on ordinary phones. Even Google’s materials frame the small models as the true mobile-first options, while the 26B A4B and 31B models are aimed at consumer GPUs and workstations. The split is the story. A developer can keep a lightweight assistant fully on-device, move a heavier multimodal task to a nearby edge node, and still fall back to centralized cloud infrastructure when scale or memory demands spike. For companies that care about latency, that means fewer excuses. For companies that care about data residency, it means new choices. And for users, it means AI is becoming less tied to a single distant place. On Google Play, AI Edge Gallery now describes itself in the simplest possible terms: “100% On-Device Privacy.”

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.