New Benchmark for On-Device AI Models Released

Published February 19, 2026 by The Daily Scout

A new benchmarking suite called MobileAIBench has been introduced to evaluate large language and multimodal models for on-device applications. The benchmark assesses accuracy, latency, memory footprint, and energy consumption, which are critical metrics for resource-constrained aerospace systems. Early results confirm that techniques like quantization and pruning are essential for running advanced AI models on embedded hardware.

Why it matters

- MobileAIBench was developed by researchers from Salesforce AI Research to provide a standardized way to evaluate open-source LLMs and Large Multimodal Models (LMMs) specifically for mobile performance. - The framework consists of two main parts: a desktop library for running evaluations on tasks like NLP and trust & safety, and a mobile app for both iOS and Android to measure on-device hardware utilization. - Specific open-source models tested in the benchmark include TinyLlama-1.1B, Phi-2, Gemma-2B, StableLM-Zephyr-3B, and the multimodal model Llava-Phi-2, all in the `.gguf` format. - Beyond latency and memory, the on-device app is designed to capture detailed hardware metrics, including CPU, RAM, GPU, and thermal state, using platform-specific tools like Apple's Instruments. - The benchmark goes beyond traditional NLP accuracy metrics to explicitly include evaluations for trust and safety, a critical consideration for models deployed in sensitive applications. - While other benchmarks have focused on the accuracy impact of quantization, MobileAIBench is designed to reveal the practical deployment challenges by measuring mobile-specific metrics like battery drain on real-world devices. - The approach of combining pruning and quantization, which MobileAIBench evaluates, has been shown in other applications to reduce model size by up to 75% and power consumption by 50% while maintaining over 95% accuracy. - The underlying evaluation framework for the mobile app relies on the `llama.cpp` inference engine, a popular choice for running LLMs efficiently on consumer hardware using C/C++.

Key numbers

Specific open-source models tested in the benchmark include TinyLlama-1.1B, Phi-2, Gemma-2B, StableLM-Zephyr-3B, and the multimodal model Llava-Phi-2, all in the .gguf format.
The approach of combining pruning and quantization, which MobileAIBench evaluates, has been shown in other applications to reduce model size by up to 75% and power consumption by 50% while maintaining over 95% accuracy.

Sources

Quick answers

What happened in New Benchmark for On-Device AI Models Released?

A new benchmarking suite called MobileAIBench has been introduced to evaluate large language and multimodal models for on-device applications. The benchmark assesses accuracy, latency, memory footprint, and energy consumption, which are critical metrics for resource-constrained aerospace systems. Early results confirm that techniques like quantization and pruning are essential for running advanced AI models on embedded hardware.

Why does New Benchmark for On-Device AI Models Released matter?

MobileAIBench was developed by researchers from Salesforce AI Research to provide a standardized way to evaluate open-source LLMs and Large Multimodal Models (LMMs) specifically for mobile performance. The framework consists of two main parts: a desktop library for running evaluations on tasks like NLP and trust & safety, and a mobile app for both iOS and Android to measure on-device hardware utilization. Specific open-source models tested in the benchmark include TinyLlama-1.1B, Phi-2, Gemma-2B, StableLM-Zephyr-3B, and the multimodal model Llava-Phi-2, all in the .gguf format. Beyond latency and memory, the on-device app is designed to capture detailed hardware metrics, including CPU, RAM, GPU, and thermal state, using platform-specific tools like Apple's Instruments. The benchmark goes beyond traditional NLP accuracy metrics to explicitly include evaluations for trust and safety, a critical consideration for models deployed in sensitive applications. While other benchmarks have focused on the accuracy impact of quantization, MobileAIBench is designed to reveal the practical deployment challenges by measuring mobile-specific metrics like battery drain on real-world devices. The approach of combining pruning and quantization, which MobileAIBench evaluates, has been shown in other applications to reduce model size by up to 75% and power consumption by 50% while maintaining over 95% accuracy. The underlying evaluation framework for the mobile app relies on the llama.cpp inference engine, a popular choice for running LLMs efficiently on consumer hardware using C/C++.

New Benchmark for On-Device AI Models Released

What happened

Why it matters

Key numbers

Sources

Quick answers

What happened in New Benchmark for On-Device AI Models Released?

Why does New Benchmark for On-Device AI Models Released matter?

Get your own daily briefing