NVIDIA and OpenAI Double GPT Model Speed
NVIDIA and OpenAI announced a collaborative effort that accelerated the output of OpenAI’s GPT OSS-120B model by nearly 2x. The performance gain was attributed to hardware-software co-design and advanced optimizations within NVIDIA's TensorRT-LLM inference library.
- The GPT-OSS-120B model utilizes a Mixture-of-Experts (MoE) architecture, which keeps the total parameter count high at 117B while only activating 5.1B parameters per token. This sparsity, combined with native MXFP4 quantization for the MoE layers, is a key architectural choice that allows the model to run on a single 80GB GPU. - The performance gains were achieved by leveraging specific features in TensorRT-LLM for NVIDIA's latest hardware, including optimized CUTLASS MoE kernels and specialized attention mechanisms for the Blackwell architecture. For the Hopper architecture, the optimizations utilized XQA kernels for its specialized attention mechanisms. - This hardware-software co-design approach allows the model's low-precision FP4 format to be handled natively by NVIDIA's Blackwell GPUs, avoiding the overhead of dequantization that can occur on less optimized hardware. - On a single NVIDIA GB200 NVL72 system, these optimizations enable the gpt-oss-120b model to serve up to 1.5 million tokens per second, which can support approximately 50,000 concurrent users. - The collaboration highlights a key difference between inference libraries: while frameworks like vLLM offer broad compatibility, TensorRT-LLM is designed to extract maximum performance by using CUDA graph optimizations and fused kernels tightly coupled to specific NVIDIA hardware features. - This release is part of a broader strategic shift for OpenAI, which had not released an open-weight model since GPT-2 in 2019. [cite: