Study Benchmarks LLMs on Edge Hardware

A new academic evaluation tested the performance of large language models (LLMs) on resource-constrained single-board computers, finding that model quantization makes inference viable on devices with as little as 2-8GB of RAM. The study highlights that sustained performance is often limited by thermal throttling and power draw rather than raw compute. ARM Cortex-A76 cores, common in high-end SBCs, achieved 1-2 tokens per second with quantized models.

- Post-Training Quantization (PTQ) is a common technique for resource-constrained environments; it converts a pre-trained model's weights and activations from 32-bit floating-point numbers to lower-precision formats like 8-bit or 4-bit integers. This can reduce a 7-billion-parameter model's memory footprint from 28 GB to as little as 3.5 GB. - Advanced quantization methods like Activation-aware Weight Quantization (AWQ) and GPTQ are designed to compress models to 4-bit precision with minimal accuracy loss by identifying and protecting the most important weights. For example, 4-bit quantized Llama 3.1 models have been shown to recover 98.9% of the accuracy of their full-precision counterparts on coding benchmarks. - The energy consumption of an LLM is a critical factor on edge devices. For a 7B parameter model, the initial training might consume around 50 MWh, but inference energy is also significant. On a single-board computer, smaller models like Phi-3 (3.8B parameters) are more energy-efficient, consuming as little as 0.93 Joules per token compared to larger 8B models. - Thermal management strategies are crucial for preventing performance degradation. On high-density AI boards, positioning high-power components away from each other and using PCB substrates with higher thermal conductivity, like ceramic-based materials, are key design considerations to improve heat dissipation. - While the study mentions ARM Cortex-A76, newer architectures and specialized hardware can significantly boost performance. For instance, collaborative efforts between Arm and Meta have enabled quantized Llama 3.2 models to run up to 20% faster on Arm Cortex-A v9 CPUs with specific instruction set extensions. - The inference speed of a 7B model can vary greatly with hardware. On a Ryzen 5800X CPU (8 cores), a quantized Mixtral 8x7B model achieves about 1.4 tokens per second. However, with partial offloading to a 16GB GPU, the speed can jump to 9.5 tokens per second. - Beyond single-board computers, split-computing or collaborative edge-cloud inference is an emerging strategy for running larger models on resource-limited devices. This approach intelligently partitions the model, offloading computationally heavy components to the cloud, which can reduce on-device energy consumption by over 77% for a 7B model compared to an edge-only baseline. - The selection of calibration data during the Post-Training Quantization process significantly impacts the model's generalization ability on diverse, real-world tasks. Counter-intuitively, using a calibration dataset with the same distribution as the test data does not always yield optimal performance.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.