Local LLMs and Gemma benchmarks

Creators and developers are increasingly running Gemma 4 and other LLMs locally on Apple Silicon, and experiments show on‑device vision agents and iPhone Neural Engine mappings that suggest practical, low‑latency workflows. Community posts show Gemma 4 doing agentic vision with MLX plus SAM 3.1 on MacBooks, while CoreML‑LLM v0.2 reports Gemma 4 E2B running on the iPhone Neural Engine with ~188ms TTFT and ~11 tok/s and 99.78% ANE op mapping. Those signals point to rising expectations around memory efficiency, thermal stability and developer tooling for on‑device AI. (x.com) (x.com) (youtube.com)

Running a language model on your own laptop used to mean fans spinning, battery draining, and a long pause before the first word appeared. In April 2026, Google pitched Gemma 4 as an open model family built to run on local hardware, with small “Effective 2B” and “Effective 4B” versions aimed at phones, laptops, and other edge devices. (blog.google) A language model is the part that predicts the next word, like phone autocomplete scaled up from one word to whole paragraphs. A local model keeps that prediction work on your device instead of sending every prompt to a remote data center. (ai.google.dev) Apple Silicon changed that local picture because the memory sits close to the processor, like keeping ingredients on the kitchen counter instead of in the garage. Apple’s Machine Learning Exchange, called MLX, is Apple’s own framework for machine learning on Apple Silicon, and its design is meant for efficient model work on Macs. (github.com) That is why developers keep pairing Gemma 4 with MLX on MacBooks. Google’s own edge announcement on April 2, 2026 said Gemma 4 was built for multi-step planning, visual processing, and support for more than 140 languages on-device, not just plain chat. (developers.googleblog.com) Vision is the next piece of the puzzle. A vision model turns pixels into labeled regions and objects, like drawing neat outlines around every item on a cluttered desk before the language model decides what to do next. (github.com) Meta’s Segment Anything Model 3.1, released March 27, 2026, is one of the tools people are using for that outlining step. Meta says SAM 3.1 adds “Object Multiplex,” a shared-memory method for tracking many objects at once, with about a 7 times speedup at 128 objects on a single H100 graphics processor compared with the November 2025 SAM 3 release. (github.com) Put those parts together and you get the demos circulating this month: Gemma 4 handling the reasoning, MLX handling the Mac-side execution, and SAM 3.1 handling the image regions. That stack is what makes an on-device “vision agent” possible on a MacBook, where the model can inspect part of an image, decide on the next step, and keep going without a server round-trip. (developers.googleblog.com) (github.com 1) (github.com 2) The iPhone side is a different test because phones have tighter power and heat limits than laptops. That is where Apple’s Neural Engine matters: it is a separate chip block built for machine learning jobs, like having a dedicated checkout lane instead of sending every shopper through the same line. (github.com) One of the clearest public numbers came from the CoreML-LLM project, which says Gemma 4 E2B can run on an iPhone 17 Pro with about 220 milliseconds of prefilling for 40 tokens in version 0.2.0, about 11 tokens per second decode speed, and 99.78 percent of measured language-model operations dispatched to the Apple Neural Engine. (github.com) CoreML-LLM also reports why those numbers are more believable than the old “it fits in a few hundred megabytes” claims people often post. Its README says the real physical memory footprint for Gemma 4 E2B on iPhone is about 873 megabytes after load and about 981 megabytes during inference, while earlier Xcode gauge readings underreported usage. (github.com) Google’s own launch helps explain why the smallest Gemma 4 models are getting so much attention. The company says the E2B and E4B versions prioritize multimodal use, low latency, and local execution, while the whole Gemma family is released under Apache 2.0 and has already built on more than 400 million downloads across generations. (blog.google) So the story is no longer “can a phone or Mac run a model at all.” The story is that developers now have concrete targets—roughly 200-millisecond first response on a phone, roughly 1 gigabyte real memory use, and near-total Neural Engine placement on iPhone—while Mac users are already wiring local vision pipelines together with MLX and SAM 3.1. (github.com 1) (github.com 2) (github.com 3)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.