Google Tests On-Device LLM Inference API

Published March 8, 2026 by The Daily Scout

Google is experimenting with a MediaPipe LLM Inference API that enables large language models to run entirely on-device across laptops, phones, and desktops. The move could unlock privacy-focused, low-latency AI features for field reporting and mobile video workflows without needing a connection to server-grade GPUs.

Why it matters

The MediaPipe LLM Inference API has been succeeded by the more robust and production-ready LiteRT-LM framework, which now powers on-device models like Gemini Nano and Gemma. This open-source C++ framework is designed for high-performance, cross-platform deployment of LLMs on edge devices, including Android, iOS, and the web, with support for CPU, GPU, and even Neural Processing Unit (NPU) acceleration. For video-centric workflows, this shift to on-device processing can significantly reduce server-side infrastructure load and costs. By handling tasks like metadata generation, content analysis, and even rough-cut editing directly on a journalist's mobile device, newsrooms can decrease the volume of data sent to the cloud for initial processing. This approach not only offers privacy benefits and low-latency performance but also creates more efficient and cost-effective video ingest pipelines. Google's Gemma 2B model, a lightweight open model, is specifically designed for such on-device scenarios. Its architecture utilizes multi-query attention, which helps to reduce memory bandwidth requirements, a common constraint on mobile devices. This makes it feasible to perform tasks like automated video clipping and metadata entry directly in the field, a process that has been shown to achieve up to an 80% accuracy rate in newsroom settings. The adoption of AI tools in newsrooms is rapidly growing, with a significant focus on streamlining video production. News organizations are increasingly turning to AI for tasks like transcription, translation, and automated content creation to keep up with the fast-paced news cycle. The ability to process video content at the edge, before it even reaches a central server, represents a significant leap forward in mobile-first news gathering.

Key numbers

Google's Gemma 2B model, a lightweight open model, is specifically designed for such on-device scenarios.
This makes it feasible to perform tasks like automated video clipping and metadata entry directly in the field, a process that has been shown to achieve up to an 80% accuracy rate in newsroom settings.

What happens next

The move could unlock privacy-focused, low-latency AI features for field reporting and mobile video workflows without needing a connection to server-grade GPUs.

Sources

Quick answers

What happened in Google Tests On-Device LLM Inference API?

Google is experimenting with a MediaPipe LLM Inference API that enables large language models to run entirely on-device across laptops, phones, and desktops. The move could unlock privacy-focused, low-latency AI features for field reporting and mobile video workflows without needing a connection to server-grade GPUs.

Why does Google Tests On-Device LLM Inference API matter?

The MediaPipe LLM Inference API has been succeeded by the more robust and production-ready LiteRT-LM framework, which now powers on-device models like Gemini Nano and Gemma. This open-source C++ framework is designed for high-performance, cross-platform deployment of LLMs on edge devices, including Android, iOS, and the web, with support for CPU, GPU, and even Neural Processing Unit (NPU) acceleration. For video-centric workflows, this shift to on-device processing can significantly reduce server-side infrastructure load and costs. By handling tasks like metadata generation, content analysis, and even rough-cut editing directly on a journalist's mobile device, newsrooms can decrease the volume of data sent to the cloud for initial processing. This approach not only offers privacy benefits and low-latency performance but also creates more efficient and cost-effective video ingest pipelines. Google's Gemma 2B model, a lightweight open model, is specifically designed for such on-device scenarios. Its architecture utilizes multi-query attention, which helps to reduce memory bandwidth requirements, a common constraint on mobile devices. This makes it feasible to perform tasks like automated video clipping and metadata entry directly in the field, a process that has been shown to achieve up to an 80% accuracy rate in newsroom settings. The adoption of AI tools in newsrooms is rapidly growing, with a significant focus on streamlining video production. News organizations are increasingly turning to AI for tasks like transcription, translation, and automated content creation to keep up with the fast-paced news cycle. The ability to process video content at the edge, before it even reaches a central server, represents a significant leap forward in mobile-first news gathering.

Google Tests On-Device LLM Inference API

What happened

Why it matters

Key numbers

What happens next

Sources

Quick answers

What happened in Google Tests On-Device LLM Inference API?

Why does Google Tests On-Device LLM Inference API matter?

Get your own daily briefing