Pods: laptop sharding

Hyperspace introduced 'Pods,' a peer‑to‑peer system that pools developer laptops to shard large models like Qwen 3.5 32B, enabling free local inference with a cloud fallback. The post pitches this as a low‑cost option for startups to run large models without relying solely on cloud GPUs. (x.com)

Running a big language model on a laptop usually fails on memory, so Hyperspace built a way to split one model across several laptops. (github.com) Hyperspace calls the system “Pods,” and its GitHub documentation says it pools machines into one cluster for “distributed and sharded inference” with an OpenAI-compatible application programming interface. The quick-start example shows a user creating a pod, inviting another member, and sharding `qwen3.5:32b` across the group. (github.com) In plain terms, sharding means one laptop holds some layers of the model and another laptop holds the rest, passing intermediate data between them while generating each token. Hyperspace’s example assigns layers 0 through 31 to one 16-gigabyte machine and layers 32 through 63 to a second 16-gigabyte machine. (github.com) That matters because Qwen3-32B is a 32.8-billion-parameter model with 64 layers, which is large enough that local use often pushes beyond a single consumer machine’s comfortable memory budget. Qwen’s official Hugging Face model card lists quantized builds such as `q4_K_M` and `q8_0`, underscoring that developers already compress the model to make it fit on smaller hardware. (huggingface.co) Hyperspace’s documentation recommends 32 gigabytes of combined video memory for a sharded Qwen 3.5 32B setup, framed as two 16-gigabyte machines working together. The same table maps 64 gigabytes to larger 70-billion-parameter-class models in quantized form. (github.com) The company says the software surveys each node’s free memory, estimates model size from the name and quantization, splits layers in proportion to available memory, and has each node download only its assigned slice. The transport layer uses libp2p protocols for activation streams, request routing, and token streaming between devices. (github.com) Pods also do not stop at local hardware. Hyperspace says requests are routed in priority order from a pod-distributed local shard to a federated peer pod, then to bring-your-own-key cloud providers, and finally to platform-funded cloud inference that charges a shared pod treasury. (github.com) That hybrid design lines up with the company’s broader pitch. Hyperspace’s GitHub profile describes the project as a peer-to-peer agent network, and its `hyperspace-node` repository says the inference network runs on libp2p with direct peer connections rather than central servers. (github.com ) (github.com) The tradeoff is speed and reliability. Splitting a model across home laptops adds network hops for every token, so Pods are aimed less at beating a datacenter graphics processing unit and more at making larger open models usable on hardware a small team already owns. That is the gap Hyperspace is trying to fill with a local-first system that falls back to the cloud when the pod runs out of room. (github.com)

Pods: laptop sharding

Get your own daily briefing