NVIDIA Nearly Doubles GPT Model Speed
NVIDIA, in collaboration with OpenAI, accelerated the output of the GPT-OSS-120B model by nearly a factor of two on its hardware. The benchmark highlights ongoing software and hardware optimization gains for large language models running on GPGPU platforms.
- The GPT-OSS-120B model utilizes a Mixture-of-Experts (MoE) architecture, which keeps only a fraction of the model's total parameters active for any given input. This specific model has 117 billion total parameters but only activates 5.1 billion for each token, significantly reducing the computational load compared to a dense model of similar size. - Efficiency is further enhanced through native MXFP4 quantization, a 4-bit floating-point representation. This quantization scheme reduces the model's memory footprint, allowing the gpt-oss-120b to run on a single NVIDIA H100 GPU with 80GB of memory. - The performance gains were benchmarked on NVIDIA's Blackwell architecture, with optimizations enabling a single GB200 NVL72 rack-scale system to achieve an inference rate of 1.5 million tokens per second. A smaller companion model, gpt-oss-20b, is designed to run on edge devices with as little as 16GB of memory. - Software techniques like tensor parallelism are key to these speed improvements by splitting model weights and computations across multiple GPUs. NVIDIA has also introduced a new technique called Helix Parallelism for its Blackwell GPUs, designed to further reduce latency during the attention and feed-forward network phases of inference. - For aerospace applications, GPGPUs are powerful for high-throughput tasks, but Field-Programmable Gate Arrays (FPGAs) are often preferred for edge inference due to their lower latency, power efficiency, and deterministic execution—critical advantages in real-time, resource-constrained environments. - Integrating any AI/ML model into airborne systems requires navigating the DO-178C certification standard, a significant challenge for non-deterministic or complex systems. The standard demands rigorous validation and verification, and current strategies for certifiable AI often involve limiting its use to lower-criticality functions (DAL-D) or using traditional deterministic software to monitor and control the AI component.