NVIDIA posts inference wins

NVIDIA shared several inference optimisations: Distributed Weight Data Parallelism (DWDP) for MoE LLMs that improves GPU TPS under imbalanced traffic, FlexTensor host‑memory offload to run very large FP8 models on smaller GPUs, and multi-node FP8 scaling for Qwen‑style MoE models on DGX Sparks. These techniques aim to raise throughput or reduce hardware needs for agentic workloads by spreading expert weights, transparently extending GPU memory, and using tensor/unified-memory tricks. (x.com) (x.com) (x.com)

Serving a large language model is mostly a memory problem before it is a math problem. A single reply can touch billions of weights, and if those weights do not fit where the graphics processor can reach them quickly, token generation slows down or the model will not run at all. (docs.nvidia.com) One reason this gets messy is mixture-of-experts design. In a mixture-of-experts model, a router picks a few specialist sub-networks for each token instead of waking up the whole model, which saves compute but creates traffic jams when too many requests want the same specialists at once. (developer.nvidia.com) Most multi-graphics-processor setups handle that by splitting the model and forcing the chips to wait for each other layer by layer. The new Distributed Weight Data Parallelism method instead keeps data-parallel execution and stores some expert weights on peer graphics processors, then fetches missing experts on demand so each chip can keep moving independently. (arxiv.org) The paper behind Distributed Weight Data Parallelism says the point is to stop one overloaded chip from stalling the whole group under imbalanced traffic. It also adds split-weight management and asynchronous remote-weight prefetch, which means the system starts pulling the next needed expert before the request fully arrives there. (arxiv.org) The second bottleneck is raw memory size. Floating-point 8, which stores numbers in 8 bits instead of larger formats, cuts memory use and usually raises throughput, but even floating-point 8 models can still be too large for a single workstation-class device. (developer.nvidia.com) That is where host-memory offload comes in. NVIDIA has been pushing unified memory on Grace systems so tensors can spill from graphics memory into central processor memory, and its Grace Hopper guidance describes this as a way to run larger models or larger batches without manually juggling every allocation. (developer.nvidia.com) The FlexTensor idea is the same trick aimed at inference: keep the hot parts on the graphics processor, park colder weights in host memory, and move them transparently when needed. You give up some speed versus having every byte on-device, but you can run a model that would otherwise be too big for the box sitting on your desk. (developer.nvidia.com) The third piece is scaling small desktop systems into a cluster. NVIDIA’s DGX Spark line is built around 128 gigabytes of unified memory per node, and NVIDIA’s own multi-node setup guide says chaining nodes together lets developers run models beyond a single system’s memory ceiling. (developer.nvidia.com, deepwiki.com) NVIDIA has already shown that precision changes can move the needle hard on this hardware. In a January 2026 DGX Spark post, the company said Qwen-235B running in NVIDIA’s 4-bit floating-point format with speculative decoding delivered up to 2.6 times the performance of floating-point 8 on the same dual-DGX Spark setup, partly because floating-point 8 saturated the pair’s combined memory. (developer.nvidia.com) Put together, these inference wins attack three different choke points. Distributed Weight Data Parallelism deals with uneven expert demand, FlexTensor-style offload deals with models that do not fit, and multi-node floating-point 8 scaling on DGX Spark deals with spreading one model across several smaller machines instead of buying one much larger server. (arxiv.org, developer.nvidia.com, deepwiki.com) That matters for agentic workloads because agents do not send one neat prompt and stop. They loop through tools, branch into many small requests, and create exactly the kind of bursty, uneven inference traffic that punishes synchronized multi-graphics-processor systems and rewards anything that can squeeze more tokens per second out of the same hardware. (docs.nvidia.com, arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.