Bonsai‑8B‑mlx: tiny 1‑bit model for Apple Silicon
An 8‑billion‑parameter model called Bonsai‑8B‑mlx has been compressed to a 1‑bit format so it can run locally on Apple Silicon, pushing the case for latency‑free on‑device intelligence. The project emphasises model compression and local inference as a path to avoid cloud round trips and reduce latency (x.com).
A language model is mostly a giant table of numbers called weights, and an 8 billion parameter model usually stores those numbers in 16-bit floating point format, which is why a normal 8 billion parameter model can take about 16.38 gigabytes just for the weights. Bonsai-8B-mlx stores those weights in a 1-bit MLX format instead, cutting parameter memory to 1.28 gigabytes on Apple Silicon. (huggingface.co) One bit means each stored weight only has two directions instead of a full range of decimal values, like replacing a dimmer switch with a simple left-or-right toggle. In Bonsai’s MLX format, each weight is encoded as either negative scale or positive scale, with one floating-point scale shared across each group of 128 weights. (huggingface.co) That trick only works if the model was built to survive extreme compression, because most language models fall apart when you crush them this hard after training. PrismML says Bonsai uses end-to-end 1-bit weights across the embeddings, attention projections, multilayer perceptron projections, and final language modeling head instead of leaving big chunks in higher precision. (huggingface.co) Apple Silicon is the target here because Apple’s MLX software stack is built for running machine learning jobs on Mac, iPhone, and iPad chips without sending requests to a cloud server. The Bonsai-8B-mlx release is packaged in MLX format and the model card says it runs on Mac, iPhone, and iPad. (huggingface.co) The speed claim is what makes people pay attention, because local models are often private but slow. PrismML says the MLX version is 8.4 times faster than a floating-point 16 version on an M4 Pro and reaches 44 tokens per second on iPhone. (huggingface.co) The catch is that this is not yet a drop-in file for the standard tools most hobbyists already use. PrismML’s demo repository says the required 1-bit inference kernels are not available in upstream MLX or upstream llama.cpp, so the current release depends on PrismML forks and prebuilt binaries. (github.com) There is also a second format called General Graphical User Format, which is the file format used by llama.cpp for local inference outside Apple’s MLX stack. The same 8 billion parameter model is offered there too, and that version shrinks to 1.15 gigabytes because it stores one scale value per 128 weights instead of the two-value layout used by MLX. (huggingface.co, huggingface.co) PrismML is not pitching this as a tiny toy model with tiny-model quality. The company says Bonsai-8B averages 70.5 across six benchmark categories while matching full-precision 8 billion parameter models at roughly one-fourteenth the size. (huggingface.co, huggingface.co) The model itself is based on the Qwen3-8B dense architecture, with 36 transformer decoder blocks and a context length of 65,536 tokens, so the compression story sits on top of a familiar modern base model rather than a brand-new architecture. That matters because it turns the launch into a test of whether aggressive compression can ride on mainstream model designs instead of replacing them. (huggingface.co) PrismML announced the Bonsai family on March 31, 2026, and released three sizes: 8 billion, 4 billion, and 1.7 billion parameters, each in both MLX and General Graphical User Format variants. The immediate bet is simple: if a useful model fits in about 1 gigabyte and runs natively on Apple devices, the slowest part of using artificial intelligence stops being the model and starts being the network you no longer need. (prismml.com, github.com)