Tiny models: 'Ternary Bonsai' video

A YouTube video titled 'Ternary Bonsai: The Tiny Model That Should Not Be This Good' showcases a highly compressed, likely low‑precision model delivering surprisingly strong edge performance compared with its size. The presentation argues that low‑bit/ternary approaches can close much of the quality gap while reducing memory and compute pressure on devices. (youtube.com)

A language model is mostly a giant table of numbers, and most of the cost comes from storing and multiplying those numbers. Ternary models cut that table down to three values — negative one, zero, and positive one — to shrink memory and speed up local inference. (arxiv.org) That idea moved from papers into a product release on April 16, 2026, when PrismML announced Ternary Bonsai in 8 billion, 4 billion, and 1.7 billion parameter sizes. PrismML said the models use 1.58 bits per weight across the whole network, including embeddings, attention, multilayer perceptrons, and the language-model head. (prismml.com) A YouTube review posted April 18 by creator Fahd Mirza framed the result in practical terms for Apple hardware, calling it a “1.58-bit” model family for Apple Silicon. The video says it is reviewing Ternary Bonsai running locally rather than a cloud-hosted demo. (youtube.com) PrismML said the 8B version uses about 1.75 gigabytes of memory, roughly one-ninth the footprint of a standard 16-bit model of the same size. The company also released packed MLX versions on Hugging Face for the 8B, 4B, and 1.7B models. (prismml.com) (huggingface.co) The immediate pitch is edge use: phones, laptops, and other devices that cannot comfortably hold a full-size 8B model in memory. PrismML said Ternary Bonsai 8B can run on iPhones, and outside coverage on April 17 described the release as an on-device model family rather than a server-first one. (prismml.com) (gigazine.net) This is part of a broader shift in low-bit artificial intelligence, where model designers stop treating compression as an after-the-fact trick. Microsoft’s BitNet project and the 2025 bitnet.cpp paper argued that ternary and near-1-bit models need custom kernels and software stacks to realize their speed and energy gains on edge hardware. (github.com) (arxiv.org) The trade-off is that tiny weights do not automatically make a strong model. Deepgrove’s earlier Bonsai work showed a 500 million parameter ternary-weight model could be competitive in its small class, but its model card also said current Hugging Face use still ran operations in 16-bit precision and recommended fine-tuning before downstream use. (github.com) (huggingface.co) PrismML’s claim is stronger: it said Ternary Bonsai keeps the 1.58-bit representation throughout the network, with “no higher-precision escape hatches.” The company also published an unpacked full-precision version for compatibility, while saying the packed format is where the memory, speed, and energy gains actually come from. (prismml.com) (huggingface.co) The benchmark claims should be read as vendor claims until independent testing catches up. PrismML said Ternary Bonsai outperforms most peers in its parameter class on standard benchmarks, while third-party coverage repeated the company’s chart that placed Ternary Bonsai 8B above some much larger-memory rivals. (prismml.com) (gigazine.net) What the video captures is the new threshold for “small.” An 8B model that fits in roughly 1.75 gigabytes is no longer small by parameter count, only by the amount of memory and power it asks from the device running it. (youtube.com) (prismml.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.