New Library Ports LLMs to Apple's Neural Engine
A new open-source library from AnemllClaw enables porting large language models to run on Apple's Neural Engine. This allows for seamless, on-device AI inference on iOS and macOS, a key development for embedding privacy-focused intelligence into handheld enterprise devices.
The ANEMLL open-source project aims to create a complete pipeline for running large language models on Apple's Neural Engine (ANE), converting them directly from formats like Hugging Face to CoreML. This initiative is crucial for developing autonomous and privacy-centric applications on edge devices, as it allows for on-device inference without an internet connection. A key focus of ANEMLL is optimizing for the ANE's constraints, such as its FP16 precision which can't handle values over 65,504, a problem for some models like Gemma 3. The library includes tools for model splitting and optimization to fit within the memory limits of iOS and macOS, and it supports models like various LLaMA and DeepSeek variants. Performance benchmarks highlight a significant trade-off between the ANE and the GPU. In one test with an 8-billion parameter model on an M4 Max, the ANEMLL-powered ANE inference ran at 9.3 tokens/sec using about 500MB of memory. In contrast, using MLX on the GPU, the same model achieved 31.33 tokens/sec but consumed a much larger 8.5GB of memory. This trade-off underscores the primary advantage of targeting the ANE: power efficiency. The ANE is approximately four times more power-efficient than the GPU, a critical factor for battery life on handheld devices like iPhones. By offloading inference to the ANE, the more power-hungry GPU is freed up for other tasks. The main performance bottleneck for running transformers is memory bandwidth, where the ANE's ceiling is lower than the GPU's. The memory bandwidth of Apple's M-series chips varies significantly, with the M4 Max reaching up to 546 GB/s, directly impacting the potential token-per-second generation speed. To help developers and researchers optimize for these hardware specifics, the ANEMLL project includes a benchmarking tool called `anemll-bench`. This tool is designed to measure the ANE's memory bandwidth and other performance metrics across different Apple Silicon chips, as this data is not publicly detailed by Apple. The ANEMLL library provides pre-converted models, including Gemma 3 and LLaMA 3.1/3.2, to accelerate development. It also features an ANE Profiler for analyzing and debugging model performance without needing to use Xcode, offering a more streamlined workflow for developers. For enterprise applications in supply chain and logistics, on-device LLMs offer the ability to process sensitive data without sending it to the cloud, enhancing security and compliance. This enables real-time, offline capabilities for tasks like summarizing reports, analyzing documents, or powering intelligent assistants on ruggedized handhelds in a warehouse or retail environment.