Google Tests On-Device LLM Inference API

Published by The Daily Scout

What happened

Google is experimenting with a MediaPipe LLM Inference API that enables large language models to run entirely on-device across laptops, phones, and desktops. The move could unlock privacy-focused, low-latency AI features for field reporting and mobile video workflows without needing a connection to server-grade GPUs.

Why it matters

The MediaPipe LLM Inference API has been succeeded by the more robust and production-ready LiteRT-LM framework, which now powers on-device models like Gemini Nano and Gemma. This open-source C++ framework is designed for high-performance, cross-platform deployment of LLMs on edge devices, including Android, iOS, and the web, with support for CPU, GPU, and even Neural Processing Unit (NPU) acceleration. For video-centric workflows, this shift to on-device processing can significantly reduce server-side infrastructure load and costs. By handling tasks like metadata generation, content analysis, and even rough-cut editing directly on a journalist's mobile device, newsrooms can decrease the volume of data sent to the cloud for initial processing. This approach not only offers privacy benefits and low-latency performance but also creates more efficient and cost-effective video ingest pipelines. Google's Gemma 2B model, a lightweight open model, is specifically designed for such on-device scenarios. Its architecture utilizes multi-query attention, which helps to reduce memory bandwidth requirements, a common constraint on mobile devices. This makes it feasible to perform tasks like automated video clipping and metadata entry directly in the field, a process that has been shown to achieve up to an 80% accuracy rate in newsroom settings. The adoption of AI tools in newsrooms is rapidly growing, with a significant focus on streamlining video production. News organizations are increasingly turning to AI for tasks like transcription, translation, and automated content creation to keep up with the fast-paced news cycle. The ability to process video content at the edge, before it even reaches a central server, represents a significant leap forward in mobile-first news gathering.

Key numbers

  • Google's Gemma 2B model, a lightweight open model, is specifically designed for such on-device scenarios.
  • This makes it feasible to perform tasks like automated video clipping and metadata entry directly in the field, a process that has been shown to achieve up to an 80% accuracy rate in newsroom settings.

What happens next

  • The move could unlock privacy-focused, low-latency AI features for field reporting and mobile video workflows without needing a connection to server-grade GPUs.

Quick answers

What happened in Google Tests On-Device LLM Inference API?

Google is experimenting with a MediaPipe LLM Inference API that enables large language models to run entirely on-device across laptops, phones, and desktops. The move could unlock privacy-focused, low-latency AI features for field reporting and mobile video workflows without needing a connection to server-grade GPUs.

Why does Google Tests On-Device LLM Inference API matter?

The MediaPipe LLM Inference API has been succeeded by the more robust and production-ready LiteRT-LM framework, which now powers on-device models like Gemini Nano and Gemma. This open-source C++ framework is designed for high-performance, cross-platform deployment of LLMs on edge devices, including Android, iOS, and the web, with support for CPU, GPU, and even Neural Processing Unit (NPU) acceleration. For video-centric workflows, this shift to on-device processing can significantly reduce server-side infrastructure load and costs. By handling tasks like metadata generation, content analysis, and even rough-cut editing directly on a journalist's mobile device, newsrooms can decrease the volume of data sent to the cloud for initial processing. This approach not only offers privacy benefits and low-latency performance but also creates more efficient and cost-effective video ingest pipelines. Google's Gemma 2B model, a lightweight open model, is specifically designed for such on-device scenarios. Its architecture utilizes multi-query attention, which helps to reduce memory bandwidth requirements, a common constraint on mobile devices. This makes it feasible to perform tasks like automated video clipping and metadata entry directly in the field, a process that has been shown to achieve up to an 80% accuracy rate in newsroom settings. The adoption of AI tools in newsrooms is rapidly growing, with a significant focus on streamlining video production. News organizations are increasingly turning to AI for tasks like transcription, translation, and automated content creation to keep up with the fast-paced news cycle. The ability to process video content at the edge, before it even reaches a central server, represents a significant leap forward in mobile-first news gathering.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Published by The Daily Scout - Be the smartest in the room.