NVIDIA Preps New Groq-Inspired AI Chip
NVIDIA is reportedly launching a new inference chip at its GTC 2026 conference that incorporates design elements from Groq's low-latency LPU. The chip is said to use high SRAM integration and 3D stacking to target the memory bottlenecks in large model inference, a key challenge for real-time aerospace systems. This is part of a broader strategy to redefine AI infrastructure with new interconnects and co-packaged optics for distributed inference.
Groq's architecture achieves its remarkable speed by prioritizing on-chip Static Random Access Memory (SRAM) for data storage, a departure from the High-Bandwidth Memory (HBM) used in traditional GPUs. This design provides an internal memory bandwidth of approximately 80 TB/s, a significant increase compared to the roughly 8 TB/s offered by off-chip HBM solutions. The result is a substantial reduction in memory access latency, a critical factor for real-time inference tasks. The trade-off for this speed is memory capacity, as SRAM is less dense than DRAM. Consequently, running large models like Llama 2 70B can necessitate the coordination of hundreds of Groq LPU chips. To manage this, Groq employs a specialized plesiosynchronous protocol that enables hundreds of LPUs to function as a single core, ensuring deterministic execution where the timing of data arrival is precisely predictable. For aerospace applications, the determinism and low latency offered by such an architecture are particularly compelling. Aerospace systems often operate under strict real-time constraints where unpredictable delays, or "jitter," are unacceptable. The ability to process data from sensors and execute control loops with consistent timing is crucial for safety and reliability in flight control and autonomous navigation systems. NVIDIA's reported adoption of these design principles, specifically the increased use of SRAM and 3D stacking, aims to address the memory bottlenecks inherent in large model inference. 3D stacking allows for vertical integration of chip layers, shortening the physical distance data must travel and thereby improving performance and power efficiency. This technique is already utilized in the creation of HBM. This strategic shift occurs as inference workloads increasingly dominate AI compute demand. While NVIDIA's GPUs, supported by the extensive CUDA software ecosystem, have been the standard for model training, inference presents a different set of engineering challenges centered on latency and efficiency. OpenAI has reportedly committed to a $30 billion purchase and investment related to this new NVIDIA technology, signaling strong market interest. NVIDIA is also advancing its broader AI infrastructure with the introduction of co-packaged optics (CPO) in its Spectrum-X and Quantum-X networking platforms. By integrating silicon photonics directly with the switch ASIC, NVIDIA aims to enhance power efficiency by up to 5 times and improve resiliency by a factor of 10 compared to traditional pluggable transceivers. The Spectrum-X Ethernet switches, available in the latter half of 2026, are designed to support massive AI data centers with million-GPU clusters.