Google Gemma 4 on Apple silicon

- Google released Gemma 4 12B Unified on June 3, adding a multimodal open model designed to run locally on laptops, including Apple silicon systems. - The key spec is 256K context, alongside native audio input, configurable thinking modes, function calling, and MLX support for Apple silicon. - Google lists Gemma 4 12B on its releases page and MLX integration docs, with downloads available through Kaggle and Hugging Face.

Google on June 3 released Gemma 4 12B, a new open-weight model positioned between its smaller edge models and larger server-oriented Gemma variants. The company said the model is designed to bring multimodal inference to laptops, with native handling for text, image and audio inputs and a memory footprint small enough for local runs on machines with 16GB of VRAM or unified memory. Google also published MLX documentation the same day showing how Gemma runs on Apple silicon through Apple’s machine learning framework. The release matters because Google is now explicitly treating Apple silicon laptops as a practical target for local inference, not just as developer test hardware. In its product post, Google said Gemma 4 12B was built as a “unified, encoder-free multimodal model” for laptops, while its developer documentation lists MLX as a supported path for running Gemma locally on Apple silicon. (ai.google.dev) ### What exactly did Google ship on June 3? Google’s Gemma release log lists “Gemma 4 12B Unified” as a June 3, 2026 launch. The model expands a Gemma 4 family that previously included E2B, E4B, 31B and 26B A4B variants released on March 31, with Multi-Token Prediction updates added on April 16. Google DeepMind product managers Olivier Lacombe and Gus Martins wrote that the 12B model is meant to “bridge the gap” between the edge-focused E4B and the larger 26B mixture-of-experts model. (blog.google) They said the 12B version packages “agentic multimodal intelligence directly to laptops” and is released under an Apache 2.0 license. ### Why is Apple silicon central to this release? (ai.google.dev) Google’s MLX integration page says MLX is “an array framework for machine learning on Apple silicon,” and it provides Gemma quick-start commands for local text generation, vision tasks and local server setup. The page shows developers can launch an OpenAI-compatible local endpoint with `mlx_vlm.server`, a setup aimed at on-device or local-network use rather than a hosted cloud service. (blog.google) Google’s launch post says Gemma 4 12B is “small enough to run locally with just 16GB of VRAM or unified memory.” On Apple hardware, unified memory is the shared memory pool used across CPU and GPU, which makes that line a direct fit for MacBooks and other Apple silicon devices. That is Google’s stated hardware target, rather than an inference drawn from third-party benchmarks. (ai.google.dev) ### What can the 12B model do that earlier Gemma 4 models could not? Google said Gemma 4 12B is its “first mid-sized model to feature native audio inputs.” The company also said the model uses an encoder-free architecture in which vision and audio inputs flow directly into the language model backbone, replacing the separate multimodal encoders commonly used in other systems. (blog.google) Google’s Gemma 4 overview page says the family supports configurable thinking modes, built-in function calling, native system prompts and up to 256K context on medium models. The same page says audio is featured natively on the E2B, E4B and 12B models, while the 12B variant is specifically described as a unified encoder-free multimodal model. ### Why are developers likely to focus on local serving and agent workflows? (blog.google) Google’s developer docs say Gemma 4 adds built-in support for function calling and structured system prompts, while the MLX page shows a local server flow with an OpenAI-compatible endpoint. That combination makes the model usable not only for chat and coding tasks, but also for local agents that call tools or process multimodal inputs on-device. (ai.google.dev) Google also says all Gemma 4 models include Multi-Token Prediction draft models for speculative decoding, which it describes elsewhere as a way to reduce latency and speed inference. That matters for local deployments, where responsiveness is often constrained by memory bandwidth and available compute rather than raw model quality alone. ### Where does this leave the next step? Google’s Gemma documentation says the 12B model can be downloaded through Kaggle and Hugging Face, while the MLX page points developers to Apple-silicon-specific local setup commands and server tooling. (ai.google.dev) The immediate next step is developer testing on MacBooks and other Apple silicon devices using MLX, with Google’s release log now listing the 12B model as part of the production Gemma 4 lineup.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.