Edge inference is scaling fast

Google’s Gemma 4 is being framed as a local‑first, on‑device model for multimodal inference, shifting some AI workloads off central cloud clusters. (infoq.com) Industry reporting also forecasts big growth in edge AI inference and cites partnerships — like Nokia and Blaize — that are pushing hybrid heterogeneous compute toward the edge. (digitimes.com)

Artificial intelligence is moving off giant cloud clusters and onto phones, cameras, factory gear, and telecom networks that can make decisions where data is created. (infoq.com) In artificial intelligence, “inference” is the step where a trained model answers a prompt or classifies an image, and running that step at the “edge” means doing it on a device or nearby server instead of a distant data center. Google said on April 2 that Gemma 4 is built for “local agentic AI on Android” across development and production. (android-developers.googleblog.com) Google’s Android team said Gemma 4 in the Artificial Intelligence Core developer preview comes in two on-device sizes, E2B and E4B, and supports multimodal inputs including text, images, and audio. Google said the new Android version is up to four times faster than earlier versions and uses up to 60% less battery. (developer.android.com) Google is not pitching Gemma 4 as cloud-only software. In its April 2 launch post, the company said developers can run smaller models directly in apps on Android devices and use larger versions on development machines, while its broader Gemma 4 announcement still pointed users to Google Cloud when local hardware hits a compute limit. (android-developers.googleblog.com) (blog.google) That split reflects a wider industry pattern: train large models in centralized data centers, then push the answering step closer to users and machines that need low delay, lower bandwidth use, or offline operation. DigiTimes reported on April 14 that edge inference could grow tenfold as generative artificial intelligence demand shifts away from centralized training workloads. (digitimes.com) Telecom vendors are trying to turn that shift into infrastructure sales. Blaize and Nokia said in January they signed a memorandum of understanding for Asia-Pacific deployments that combine Nokia networking and automation with Blaize’s programmable inference platform for edge, cloud, and data-center environments. (prnewswire.com) The two companies expanded that work in Singapore at GITEX Asia 2026, according to DigiTimes and the event newsroom, with a joint showcase built around “hybrid heterogeneous computing,” meaning different kinds of processors split the same workload across edge systems and central infrastructure. GITEX Asia said the showcase ran April 9 and April 10. (digitimes.com) (gitexasia.com) Google is also framing the edge push around software control, not just hardware. InfoQ reported on April 13 that Gemma 4 is meant to support the Android software lifecycle from coding to production, extending on-device models from chat features into tool-calling agents that can act inside apps. (infoq.com) The argument for local inference is straightforward: a phone or gateway can answer some requests without sending raw data back to the cloud, which can reduce latency and keep services running when connectivity is weak. Google’s April 2 posts emphasized offline use on Android, while Nokia and Blaize described systems that still connect cleanly to cloud and graphics processing unit infrastructure when workloads outgrow the edge. (blog.google) (prnewswire.com) The result is not the end of the cloud so much as a new division of labor. Google is shrinking capable models to fit Android devices, while network and chip vendors are building systems that decide, prompt by prompt, what stays local and what gets sent upstream. (android-developers.googleblog.com) (digitimes.com)

Edge inference is scaling fast

Get your own daily briefing