Apple M1 benchmarking results scarce
- Apple M1 benchmark data for local language-model inference remained hard to verify on May 22, 2026, with recent social posts skewing toward newer M4 Pro results. - A llama.cpp Apple Silicon benchmark table lists baseline M1 decode speeds around 14.15 tokens per second for a 7B-class Q4_0 run. - Apple’s Core ML Llama example points to M1 Max at about 33 tokens per second; dedicated M1 retests remain pending.
Apple M1 benchmark data for local large-language-model inference is available, but not in the kind of recent, reproducible stream that now surrounds newer Apple chips. Searches over the last two days turned up more references to M4 Pro and M4 Max performance than to the original M1, and the most-circulated social post in that window discussed M4 Pro sustaining roughly 30 to 80 tokens per second for local inference rather than publishing new M1 measurements. GitHub and Apple documentation still provide useful M1-era reference points. A maintained llama.cpp discussion comparing Apple Silicon systems includes entries for the base M1, M1 Pro, M1 Max and M1 Ultra, while Apple’s own machine-learning research page gives a separate Core ML example for Llama 3.1 on M1 Max. Those sources do not amount to a fresh M1 retest, but they do show why the original chip is still part of the conversation. (github.com) ### Which M1 numbers can actually be verified? The clearest published M1 figures come from the llama.cpp Apple Silicon benchmark discussion started by maintainer Georgi Gerganov in November 2023. In that table, a base M1 with a 7B-class model in Q4_0 is shown at about 14.15 to 14.19 tokens per second for token generation, while Q8_0 generation is listed near 7.9 tokens per second. (github.com) An older llama.cpp issue from March 2023 gives another point of reference. That thread lists an M1 running a 7B model at 94.24 milliseconds per token, which works out to roughly 10.6 tokens per second, and a 13B model at 202.18 milliseconds per token, or about 4.9 tokens per second. ### Why are M1 results harder to compare than the newer M4 chatter? Model choice, quantization and software stack all change the answer. (github.com) The llama.cpp table separates prompt processing from token generation and reports different results for F16, Q8_0 and Q4_0, while Apple’s Core ML example uses Llama-3.1-8B-Instruct and reports decoding speed under a different framework. (github.com) That means an M1 number from llama.cpp is not directly interchangeable with an M1 Max number from Core ML, and neither should be treated as a universal “Apple Silicon speed.” A GitHub repository dedicated to LLM benchmarking on Apple Silicon also shows how users mix MLX, llama.cpp, LM Studio and Ollama when reporting results, which further complicates direct comparisons. (github.com) ### What do Apple’s own materials show? Apple’s machine-learning research team says a Mac with M1 Max can run Llama-3.1-8B-Instruct locally at about 33 tokens per second decoding speed using Core ML and the optimizations described in its post. The company presents that as an on-device example rather than a broad benchmark suite across all M1-family systems. (github.com) Hugging Face, in a separate post on Swift Transformers and Core ML support, also pointed to a video demonstration of a Llama 2 7B chat model running on an M1 MacBook Pro. That post underscores that Apple-device inference is feasible on older hardware, but it does not provide a standardized M1 leaderboard in the way users often want. ### So what is missing from the M1 picture right now? (machinelearning.apple.com) What is missing is a recent, reproducible M1 test set using the same model, quantization, context length and software stack now being used to discuss M4 Pro numbers. The available sources establish that base M1 systems can generate tokens locally and provide historical ranges, but they do not supply a current apples-to-apples comparison against the latest Apple chips. (huggingface.co) Community projects are trying to fill that gap. A Hugging Face forum post this month described an Apple Silicon benchmarking app called Anubis with public submissions covering M1 through newer chips, including tokens per second and time-to-first-token, but that effort is still community-driven rather than a settled reference standard. (github.com) ### Where would a cleaner M1 answer likely come from next? The next useful M1 datapoint will likely come from a controlled retest in a public repo or benchmark harness, not from a stray social post. The most relevant places to watch are the llama.cpp Apple Silicon discussion, Apple’s machine-learning research pages, and community benchmark repositories that publish model names, quantization settings and token-rate methodology alongside the raw numbers. (discuss.huggingface.co) (github.com)