NVIDIA’s LPX racks reshape cluster design

GTC coverage dug into NVIDIA’s LPX rack systems—LPUs, GPUs, CPUs and high‑speed fabric combined—signaling clusters will need topology‑aware orchestration and co‑optimization across compute and networking. For platform engineers, that elevates rack-level scheduling and topology-aware placement as first‑order concerns. (theregister.com)

NVIDIA’s Groq 3 LPX rack is built around 256 Groq 3 LPUs, delivering roughly 128 GB of total SRAM per rack and about 40 PB/s of on‑chip SRAM bandwidth, with a dedicated high‑radix chip‑to‑chip fabric rated at 640 TB/s per rack. (tomshardware.com) Each Groq 3 LPU contains about 500 MB of SRAM and advertises ~150 TB/s of local memory bandwidth per chip, while Rubin GPUs use 288 GB of HBM4 at ~22 TB/s, creating a high‑bandwidth but low‑capacity decode tier distinct from GPU HBM. (tomshardware.com) NVIDIA describes LPX execution as deterministic and compiler‑orchestrated with explicit data movement, and it says its Dynamo runtime will classify requests and orchestrate per‑token disaggregated serving that routes prefill/attention work to Rubin GPUs and latency‑sensitive decode/FFN work to LPUs. (developer.nvidia.com) LPX is presented as a co‑designed component of the Vera Rubin NVL72 platform alongside Vera CPUs, ConnectX‑9 SuperNICs, BlueField‑4 DPUs and Spectrum‑X/Spectrum‑6 switching, and NVIDIA indicated LPX and the broader Rubin stack will begin shipping in the second half of 2026. (investor.nvidia.com) Given the LPX rack’s 640 TB/s fabric and explicit per‑token offload model, the hardware establishes rack‑level locality domains that will need to be respected by schedulers — an inference supported by NVIDIA’s Dynamo design and LPX fabric specifications. (developer.nvidia.com) Existing topology‑aware schedulers and frameworks already expose rack and network locality primitives (for example, unified topology APIs, bin‑pack plugins and network topology CRDs), and those same primitives map directly to the placement and gang‑scheduling needs LPX+Rubin deployments will create. (run-ai-docs.nvidia.com) NVIDIA claims the LPX+Rubin combination can yield up to ~35× higher inference throughput per megawatt and explicitly targets interactive, agentic workloads with token rates approaching 1,000 tokens per second per user. (developer.nvidia.com)

NVIDIA’s LPX racks reshape cluster design

Get your own daily briefing