Groq LPUs join Rubin for speedups
What happened
Groq LPUs have been integrated into NVIDIA's Rubin/Vera inference story to target decode and latency layers, with devs reporting up to ~10% gains on some workloads reported. The result is more hybrid LPU+GPU architectures showing up in inference stacks rather than pure‑GPU deployments.
Why it matters
The Groq 3 LPX rack is built around 256 Groq 3 LPU accelerators. (developer.nvidia.com) Each Groq 3 LPU packs roughly 500 MB of on‑die SRAM and delivers about 150 TB/s of internal bandwidth versus Rubin GPUs’ 288 GB of HBM4 at ~22 TB/s. (tomshardware.com) At rack scale LPX aggregates ~128 GB of SRAM with roughly 40 PB/s of SRAM bandwidth and a 640 TB/s per‑rack scale‑up interconnect. (tomshardware.com) NVIDIA positions LPX to accelerate latency‑sensitive decode work—explicitly FFN layers and MoE expert execution—while Rubin GPUs continue to handle prefill and decode attention. (developer.nvidia.com) NVIDIA’s own materials claim the Rubin+LPX heterogeneous path can deliver up to 35× higher inference throughput per megawatt and as much as 10× more revenue opportunity for trillion‑parameter models. (developer.nvidia.com) NVIDIA says the Groq 3 LPX rack will ship alongside the Vera Rubin NVL72 family in H2 2026 after its licensing and team integration stemming from the Groq deal. (crn.com)
Key numbers
- Groq LPUs have been integrated into NVIDIA's Rubin/Vera inference story to target decode and latency layers, with devs reporting up to ~10% gains on some workloads reported.
- The Groq 3 LPX rack is built around 256 Groq 3 LPU accelerators.
- (developer.nvidia.com) Each Groq 3 LPU packs roughly 500 MB of on‑die SRAM and delivers about 150 TB/s of internal bandwidth versus Rubin GPUs’ 288 GB of HBM4 at ~22 TB/s.
- (tomshardware.com) At rack scale LPX aggregates ~128 GB of SRAM with roughly 40 PB/s of SRAM bandwidth and a 640 TB/s per‑rack scale‑up interconnect.
What happens next
- (developer.nvidia.com) NVIDIA says the Groq 3 LPX rack will ship alongside the Vera Rubin NVL72 family in H2 2026 after its licensing and team integration stemming from the Groq deal.
- (crn.com) Groq LPUs have been integrated into NVIDIA's Rubin/Vera inference story to target decode and latency layers, with devs reporting up to ~10% gains on some workloads reported.
Quick answers
What happened in Groq LPUs join Rubin for speedups?
Groq LPUs have been integrated into NVIDIA's Rubin/Vera inference story to target decode and latency layers, with devs reporting up to ~10% gains on some workloads reported. The result is more hybrid LPU+GPU architectures showing up in inference stacks rather than pure‑GPU deployments.
Why does Groq LPUs join Rubin for speedups matter?
The Groq 3 LPX rack is built around 256 Groq 3 LPU accelerators. (developer.nvidia.com) Each Groq 3 LPU packs roughly 500 MB of on‑die SRAM and delivers about 150 TB/s of internal bandwidth versus Rubin GPUs’ 288 GB of HBM4 at ~22 TB/s. (tomshardware.com) At rack scale LPX aggregates ~128 GB of SRAM with roughly 40 PB/s of SRAM bandwidth and a 640 TB/s per‑rack scale‑up interconnect. (tomshardware.com) NVIDIA positions LPX to accelerate latency‑sensitive decode work—explicitly FFN layers and MoE expert execution—while Rubin GPUs continue to handle prefill and decode attention. (developer.nvidia.com) NVIDIA’s own materials claim the Rubin+LPX heterogeneous path can deliver up to 35× higher inference throughput per megawatt and as much as 10× more revenue opportunity for trillion‑parameter models. (developer.nvidia.com) NVIDIA says the Groq 3 LPX rack will ship alongside the Vera Rubin NVL72 family in H2 2026 after its licensing and team integration stemming from the Groq deal. (crn.com)