Groq LPUs join Rubin for speedups
Groq LPUs have been integrated into NVIDIA's Rubin/Vera inference story to target decode and latency layers, with devs reporting up to ~10% gains on some workloads reported. The result is more hybrid LPU+GPU architectures showing up in inference stacks rather than pure‑GPU deployments.
The Groq 3 LPX rack is built around 256 Groq 3 LPU accelerators. (developer.nvidia.com) Each Groq 3 LPU packs roughly 500 MB of on‑die SRAM and delivers about 150 TB/s of internal bandwidth versus Rubin GPUs’ 288 GB of HBM4 at ~22 TB/s. (tomshardware.com) At rack scale LPX aggregates ~128 GB of SRAM with roughly 40 PB/s of SRAM bandwidth and a 640 TB/s per‑rack scale‑up interconnect. (tomshardware.com) NVIDIA positions LPX to accelerate latency‑sensitive decode work—explicitly FFN layers and MoE expert execution—while Rubin GPUs continue to handle prefill and decode attention. (developer.nvidia.com) NVIDIA’s own materials claim the Rubin+LPX heterogeneous path can deliver up to 35× higher inference throughput per megawatt and as much as 10× more revenue opportunity for trillion‑parameter models. (developer.nvidia.com) NVIDIA says the Groq 3 LPX rack will ship alongside the Vera Rubin NVL72 family in H2 2026 after its licensing and team integration stemming from the Groq deal. (crn.com)