Apple‑Silicon ML tests

Hands-on benchmarks on M3 Ultra and preliminary M5 Max hardware are reporting MLX Engine performance data and experiments with MiniMax M2.7 model quantizations. The tests show practical performance signals for on-device inference and quantized models on Apple Silicon. (x.com)

Running a language model on a Mac means loading billions of weights into memory and multiplying them quickly enough to answer in real time. New hands-on tests on Apple silicon are now putting concrete numbers on how far that can go on M3 Ultra desktops and early M5 Max laptops. (github.com) (machinelearning.apple.com) The underlying software is MLX, Apple’s machine-learning framework for Apple silicon. Apple says MLX uses the chips’ unified memory, so the central processor and graphics processor can work on the same data without copying it back and forth first. (ml-explore.github.io) (machinelearning.apple.com) Those hardware details matter because Apple’s March 5, 2025 launch of M3 Ultra pushed Mac memory far past typical laptop graphics limits. Apple said M3 Ultra supports 96 gigabytes to 512 gigabytes of unified memory, more than 800 gigabytes per second of memory bandwidth, and a 32-core Neural Engine. (apple.com) Apple added another piece on November 19, 2025, when its machine-learning research team said the latest macOS beta let MLX tap Neural Accelerators in the M5 graphics processor. Apple said those units handle matrix multiplication, the repeated math operation that dominates large-language-model inference. (machinelearning.apple.com) The new community benchmarks are early, but they line up with that design: they focus on token generation rates, memory use, model load times, and different quantization settings on Apple silicon. The repository says its goal is to compare MLX with tools including llama.cpp, LM Studio, and Ollama using reproducible runs. (github.com) Quantization is the compression step in this story. Apple’s MLX team says quantization stores model parameters at lower precision to cut memory use, which is why a model that is too large in full precision can become practical on a single machine after conversion. (machinelearning.apple.com) That is where MiniMax M2.7 enters the picture. MiniMax describes M2.7 as a 230-billion-parameter text model, and Hugging Face now lists quantized variants built for local runtimes, including MLX and other formats. (minimax.io) (huggingface.co 1) (huggingface.co 2) One Hugging Face quantization page shows why Apple’s memory ceiling matters: MiniMax M2.7 quantized builds range from roughly 60.7 gigabytes at 1-bit to 243 gigabytes at 8-bit, while the full BF16 version is listed at 457 gigabytes. Those sizes sit squarely in the range where a 128-gigabyte laptop and a 512-gigabyte desktop behave like very different classes of local AI machine. (huggingface.co) The broader shift is that local-model software is starting to standardize around Apple’s stack. Ollama added MLX support for Apple silicon in late March 2026, and Ars Technica reported that the change was aimed at faster inference on Macs. (arstechnica.com) (appleinsider.com) The result is not a lab claim from a chip launch slide but a practical test: how large a model fits, how aggressively it can be quantized, and how many tokens per second a Mac can actually sustain. That is the question these M3 Ultra and early M5 Max runs are starting to answer. (github.com) (machinelearning.apple.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.