NVIDIA Accelerates Inference for OpenAI and Meta Models
NVIDIA announced it has achieved a nearly 2x improvement in output speed for OpenAI’s GPT OSS-120B model through software co-optimization. The company also highlighted its role in powering Meta's new Llama-4 Scout model. These efforts underscore the continued importance of libraries like TensorRT-LLM for maximizing performance on Nvidia hardware.
- OpenAI's gpt-oss-120b is a Mixture-of-Experts (MoE) transformer with 116.8 billion parameters and a 128k token context length. It was released with native MXFP4 quantization, a 4-bit floating-point format, which significantly reduces its memory footprint. - Meta's Llama-4 Scout is a multimodal MoE model with 109 billion total parameters (17 billion active per token) and 16 experts. It is designed for efficiency and can run on a single NVIDIA H100 GPU using INT4 quantization, supporting a context window of up to 10 million tokens. - TensorRT-LLM achieves performance gains through techniques like in-flight batching, paged key-value (KV) caching, and custom attention kernels. A key feature is chunked prefill, which breaks the initial prompt processing into smaller pieces to improve GPU utilization and better handle long contexts. - The optimizations for the GPT-OSS model are tightly coupled with NVIDIA's latest hardware, such as the Blackwell B200/GB200 GPUs, which feature native FP4 TensorCores essential for efficiently processing models with 4-bit precision. - For inference serving, TensorRT-LLM focuses on ahead-of-time kernel fusion and graph compilation to achieve the lowest possible latency on stable workloads. This contrasts with frameworks like vLLM, which often provide greater flexibility and faster integration with a wide range of Hugging Face models by using more dynamic runtime optimizations. - The push for greater inference efficiency directly addresses primary enterprise challenges in deploying LLMs, which include managing high computational costs, ensuring low latency for real-time applications, and scaling infrastructure to handle large workloads.