NVIDIA and OpenAI Double Inference Throughput
NVIDIA and OpenAI announced a collaboration that has nearly doubled the inference throughput for OpenAI's GPT OSS-120B model. The performance gain was achieved through joint hardware-software co-design, highlighting the benefits of tightly integrating the model with the underlying GPU stack.
- The collaboration is built on a long-standing relationship dating back to 2016, when NVIDIA CEO Jensen Huang personally delivered the first DGX-1 supercomputer to OpenAI. - The GPT-OSS-120B model utilizes a Mixture-of-Experts (MoE) architecture with 117 billion total parameters, but only 5.1 billion are active per token, and it was trained on NVIDIA H100 Tensor Core GPUs. This model is designed for agentic tasks, featuring capabilities like chain-of-thought and tool use. - The performance gains were achieved on NVIDIA's Blackwell architecture, where a single GB200 NVL72 system can deliver up to 1.5 million tokens per second. Specific software optimizations include specialized attention and MoE routing kernels available through TensorRT-LLM and CUTLASS. - The model is released in FP4 precision, a 4-bit quantization format, which allows the entire 117B parameter model to fit on a single 80 GB data center GPU like an H100. - Open-source serving frameworks like vLLM are being used in conjunction with NVIDIA's tools to further optimize performance, with one company achieving over 650 tokens per second by parallelizing the model across 4 or 8 GPUs using Tensor Parallelism. - Other advanced inference optimization techniques being applied to models like GPT-OSS-120B include speculative decoding, where a smaller "draft" model predicts tokens that are then validated by the larger model. - This hardware-software co-design approach is a broader industry trend aimed at mitigating the significant computational and memory demands of large-scale models, which can require Zeta (10^21) floating-point operations. - The partnership between NVIDIA and OpenAI has expanded into a major AI infrastructure deal, with plans to deploy multi-gigawatt data centers powered by millions of NVIDIA GPUs.