DIY cluster ran a 685B model

A DIY build of 32 Intel N100 mini PCs, paired with an RTX 5090, reportedly ran DeepSeek‑V3.2 (a 685‑billion‑parameter model) at roughly 16 tokens/sec decode with an 89% acceptance rate, using Ethernet and Mixture‑of‑Experts routing across CPUs. The builder posted hardware details and performance numbers as proof of concept for low‑cost distributed inference (x.com).

A language model is a text engine that predicts the next word, and the biggest ones usually need server-class hardware to run. One hobbyist build said it pushed that work onto 32 low-power mini computers and one consumer graphics card. (huggingface.co) (x.com) The builder, posting on X as squirrel__cute, said the cluster ran DeepSeek-V3.2, an open model family DeepSeek describes as tuned for reasoning and tool use. The post listed 32 Intel Processor N100 mini personal computers, Ethernet networking, and an NVIDIA GeForce RTX 5090. (x.com) (huggingface.co) (nvidia.com) The performance claim was about 16 tokens per second in decode, with an 89 percent acceptance rate. In plain terms, the system was generating about 16 chunks of text each second while keeping most of its speculative guesses instead of throwing them away. (x.com) (github.com) Mixture-of-Experts is the design choice behind that split. Instead of waking up the full model for every token, the system routes each token to a smaller subset of specialists, and DeepSeek’s earlier V3 paper says 37 billion of 671 billion parameters are activated for each token. (github.com) That matters for home-built inference because memory, not just raw compute, is the wall. NVIDIA says the GeForce RTX 5090 has 32 gigabytes of GDDR7 memory, far short of what a dense model with hundreds of billions of parameters would need in one place. (nvidia.com) The Intel side of the build is also modest by data-center standards. Intel’s N-series brief says these chips are aimed at entry-level systems and can be configured with as few as four efficient cores, which helps explain why the project leaned on many small boxes instead of one large server. (intel.com) DeepSeek’s published work has emphasized the same basic constraint from the other direction: moving data between machines can become the bottleneck. Its V3 repository says the company worked to reduce cross-node communication overhead in training, and the hobbyist build is effectively testing a cheaper version of that idea for inference over ordinary Ethernet. (github.com) The post does not amount to an independent benchmark, and the reported numbers come from the builder rather than a lab or vendor test. But the hardware list, routing approach, and speed figures were detailed enough to turn the project into a concrete proof-of-concept for distributed inference outside a data center. (x.com) The result was not that a bargain desktop suddenly became a supercomputer. It was that a pile of low-cost machines, wired together carefully, appeared to keep a very large open model talking at a usable speed. (x.com)

DIY cluster ran a 685B model

Get your own daily briefing