Open, efficient AI pushes edge
Meta and Google released more efficient open models to run on diverse hardware, making local inference for support and control‑plane tasks more practical. Meta rolled out Llama 4 variants including Llama 4 Scout and Maverick as open‑weight multimodal MoE models, while Google’s Gemma 4 is positioned to run across devices and claims token‑efficiency gains that reduce inference cost and bandwidth pressure (ainvest.com) (evermx.com) (verdict.co.uk) (geeky-gadgets.com).
The race in AI has been framed as a contest to build the biggest model and the biggest data center. This week, Meta and Google made a different bet. They released open models designed to do more work with less hardware, which matters because a large share of useful AI jobs do not need a giant remote model at all. They need something that can run close to the user, inside an app, on a workstation, or on the machines that keep a service running. Google introduced Gemma 4 on April 2, 2026, under an Apache 2.0 license, and positioned it explicitly for phones, laptops, desktops, and edge devices. Meta’s Llama 4 Scout and Maverick, first released on April 5, 2025, were built around the same practical idea from the other direction: open weights, multimodal input, and sparse architectures that avoid lighting up every parameter for every token (blog.google) (ai.google.dev) (about.fb.com). That last detail is the hinge. Both families lean on mixture-of-experts designs, which keep total model size high while activating only a smaller slice of the network at a time. Google’s new line includes a 26B MoE model alongside smaller E2B and E4B models and a 31B dense model. Meta split Llama 4 into Scout and Maverick, both with 17 billion active parameters, but very different expert layouts: Scout uses 16 experts and Maverick uses 128. The point is not just elegance. Sparse activation cuts the compute needed for inference, which is the expensive part when companies deploy models at scale or try to run them locally (ai.google.dev) (huggingface.co) (pytorch.org). Once you see that, the product choices make more sense. Google says Gemma 4 was sized to run “from billions of Android devices worldwide, to laptop GPUs, all the way up to developer workstations and accelerators.” Its model card is even more blunt: the family targets deployments from high-end phones to servers, with the smallest models optimized for local execution and the medium models stretching to 256,000 tokens of context. Meta made a similar claim for Scout, saying it can outperform prior Llama models while fitting on a single Nvidia H100, and paired that with a headline-grabbing 10 million token context window. These are not consumer-chat bragging rights. They are infrastructure claims. A model that fits on one accelerator, or on-device, is easier to place inside support tools, moderation systems, copilots, routers, and control-plane software that cannot afford a slow round trip to a distant API every time something changes (blog.google) (ai.google.dev) (about.fb.com). The open licensing matters almost as much as the hardware profile. Google put Gemma 4 under Apache 2.0, which removes a lot of the hesitation enterprises had around narrower “open” terms. Meta’s Llama 4 remains open-weight rather than open-source in the strict software sense, with a custom community license and gated access through its repositories. That difference is not academic. If a company wants to adapt a model for internal tooling, run it in regulated environments, or ship it inside a product without betting the business on someone else’s API pricing, licensing terms decide what is actually possible (blog.google) (ai.google.dev) (huggingface.co). There is also a more basic reason this shift matters. The useful edge workloads are usually narrow, repetitive, and latency-sensitive. They classify tickets. They summarize logs. They inspect screenshots. They call tools. They keep a service alive. For those jobs, the frontier is not a model that can write a novel. It is a model that can see an image, reason through a few steps, and respond cheaply enough to be called all day. Google is pitching Gemma 4 exactly that way, with “agentic workflows” and edge deployment tools through AI Edge and LiteRT-LM. Meta’s own engineering notes on Llama 4 focus not on lofty intelligence claims but on the ugly mechanics of making sparse models run fast enough in production despite memory and communication pressure. That is the real story here. The industry is still building giant centralized models, but some of the most important new releases are trying to escape the data center, one active parameter at a time (developers.googleblog.com) (pytorch.org) (blog.google).