On-device LLMs hit production paths
Open-source and vendor projects are pushing LLM inference onto devices and small edge hosts, making on-device assistants more practical for enterprise endpoints. Google's LiteRT-LM aims to run Gemma and Llama models with GPU/NPU acceleration on Android, iOS and Raspberry Pi, and Locally AI published Gemma-optimized models for edge Macs and similar hosts. For edge-heavy operations, that means more options to run private, low-latency models on handhelds or local gateways instead of always routing to the cloud. (x.com) (x.com)
For years, most artificial intelligence assistants worked like a walkie-talkie: your phone recorded a request, sent it to a data center, and waited for an answer to come back. Google is now opening up the engine it already uses for on-device language models in Chrome, Chromebook Plus, and Pixel Watch, so more developers can keep that loop on the device itself. (developers.googleblog.com) That engine is called LiteRT-LM, and Google describes it as a production-ready, open-source framework for running large language models directly on edge devices. The supported targets now include Android, iOS, web, desktop, and internet-of-things hardware such as Raspberry Pi. (ai.google.dev) A large language model is the part that predicts the next word over and over until it forms an answer. Inference is the moment that prediction happens live for a user, and moving inference onto a phone or local box cuts out the round trip to a remote server. (ai.google.dev) The hard part is speed, because a phone chip has far less power and memory than a cloud graphics processor. LiteRT-LM is built to use graphics processors and neural processing units, which are the special-purpose blocks inside modern mobile chips that handle matrix math more efficiently than a central processor alone. (ai.google.dev 1) (ai.google.dev 2) Google is not pitching this as a lab demo. Its own blog says the same LiteRT-LM stack has already powered Gemini Nano deployments in Chrome, Chromebook Plus, and Pixel Watch, and the company is now exposing lower-level interfaces so outside developers can build custom pipelines on top of that same runtime. (developers.googleblog.com 1) (developers.googleblog.com 2) The model list is broader than Google’s own family. Google’s documentation says LiteRT-LM can run Gemma, Llama, Phi-4, and Qwen models, which means a company can pick smaller open models instead of waiting for one vendor’s cloud application programming interface. (ai.google.dev) Google also added features that used to be associated with cloud agents. The current LiteRT-LM docs mention vision and audio support, and Google’s command-line tool now supports tool calling, which lets a local model trigger a defined function instead of only generating plain text. (ai.google.dev) (developers.googleblog.com) The other half of the story is that software around these models is getting tuned for specific hardware instead of treating every laptop the same. Locally AI says its app runs Gemma, Llama, Qwen, and DeepSeek models offline on iPhone, iPad, and Mac, with builds optimized for Apple Silicon. (locallyai.app) That changes the tradeoff for companies with field devices, retail terminals, and factory gateways. A handheld scanner or a Raspberry Pi-class box can now run a private assistant nearby for summarizing notes, extracting structured fields, or calling a local tool, without sending every prompt through the public internet first. (ai.google.dev) (developers.googleblog.com) Cloud models are still larger, still smarter on many tasks, and still easier when you need one system to serve millions of users at once. But the production path is now clearer than it was a year ago: the model gets smaller, the runtime gets faster, and the device in your hand starts doing work that used to require a server rack. (ai.google.dev) (developers.googleblog.com)