Gemma 4 and local runs

Google’s Gemma 4 family was mentioned on social as a new open model lineup, and a developer guide published this weekend explains how to run Gemma 4 locally with tools like Ollama, llama.cpp and vLLM while noting VRAM and trade‑offs. The pairing of the announcement and the how‑to guide spotlights both the model release and local deployment options. (x.com) (dev.to)

A large language model is a prediction engine that guesses the next token in a sequence, and Google’s new Gemma 4 lineup is built so some versions can run on a phone, a laptop, or a workstation instead of only in a cloud data center. Google announced Gemma 4 on April 2, 2026 as an Apache 2.0 open-weights family with four sizes: Effective 2B, Effective 4B, 26B A4B, and 31B. The company said the models were built from the same research stack as Gemini 3. The family handles text and image input across all sizes, adds audio support on the smaller models, supports more than 140 languages, and stretches context windows to 128,000 tokens on the small models and 256,000 on the larger ones. Google also ships both pre-trained and instruction-tuned versions. Running a model “locally” means the weights sit on your own machine and inference happens there, which can cut cloud bills and keep prompts on-device. Google pitched Gemma 4 for hardware ranging from Android devices and laptop graphics cards to developer workstations and accelerators. The split inside the lineup is practical. E2B and E4B are the edge models for tighter hardware limits, while 26B A4B and 31B are the workstation-class options for users who want more capability and have more memory to spare. That hardware trade-off shows up fastest in local tools. Ollama already lists Gemma 4 tags with download sizes of 7.2 gigabytes for E2B, 9.6 gigabytes for E4B, 18 gigabytes for 26B A4B, and 20 gigabytes for 31B in one-click style packages. For developers who want an OpenAI-style server on their own machine, vLLM added a Gemma 4 guide with support for NVIDIA and Advanced Micro Devices graphics processors as well as Google Cloud tensor processing units. Its recipe lists a minimum of one 24-gigabyte NVIDIA graphics card for E2B or E4B in bfloat16, and one 80-gigabyte card for 31B or 26B A4B. For lower-level local inference, llama.cpp has already merged Gemma 4-specific parser and tokenizer changes into its repository. That matters because llama.cpp is the stripped-down C and C++ runtime many developers use to squeeze models onto consumer hardware through quantized files and custom builds. Google is also framing Gemma 4 as more than a chat model. The official model card highlights reasoning, coding, native system prompts, and function calling, while the edge blog post ties the release to on-device “agentic” workflows that can plan across multiple steps. The immediate question for developers is not whether Gemma 4 exists, but which size fits their machine and toolchain. The release landed with official docs, Ollama packaging, vLLM serving guidance, and fresh llama.cpp support in the same week, which makes local testing possible before a team commits to a cloud deployment.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.