Llama 70B FP8 Model Shows Performance Bottlenecks

A recent developer report highlighted real-world performance challenges with new model formats, even on high-end hardware. A user running the Llama 70B 3.3 Instruct FP8 model with TensorRT-LLM on a DGX Spark/GB10 setup reported a throughput of just 3 tokens per second. The issue suggests that achieving high performance with quantized models is still sensitive to architecture, prompt length, and specific hardware tuning.

- The DGX GB200 system, which the user reported employing, is a high-end AI supercomputing rack. A single liquid-cooled rack contains 36 NVIDIA Grace CPUs and 72 Blackwell GPUs, interconnected by NVLink switches. This architecture is specifically designed for training and inferencing on trillion-parameter models, offering up to 30 times faster inference than previous H100 systems while reducing energy consumption. - FP8 quantization is a technique that reduces a model's memory footprint and can accelerate inference by using an 8-bit floating-point format instead of the typical 16-bit (FP16/BF16). While this can lead to 2-4x speedups, FP8's smaller dynamic range compared to 16-bit formats requires careful calibration to maintain model accuracy, especially for models with outlier values in activations. - TensorRT-LLM is an open-source NVIDIA library designed to optimize LLM inference performance on NVIDIA GPUs. It achieves this through techniques like kernel fusion, paged attention, and quantization, creating a highly optimized "engine" for a specific GPU architecture. However, this optimization can be complex, requiring model compilation and specific parameter tuning to achieve maximum performance. - While TensorRT-LLM is built for peak performance on NVIDIA hardware, frameworks like vLLM are often considered easier to use and integrate, especially with Hugging Face models. Benchmarks comparing the two show varied results; TensorRT-LLM can achieve higher throughput with large batch sizes, whereas vLLM may be faster with smaller batches or show lower latency in certain configurations. - Performance of quantized models is highly sensitive to the specific quantization scheme used. For FP8, there are "static" and "dynamic" approaches. Dynamic quantization, where scaling factors are calculated at runtime, can sometimes preserve accuracy better but may introduce overhead compared to static methods where scales are predetermined. The optimal choice often depends on the workload and batch size. - The reported 3 tokens/second is significantly below typical performance for a 70B model. For comparison, benchmarks for the Llama 3.3 70B model on various platforms show speeds ranging from dozens to hundreds of tokens per second, with some specialized hardware claiming over 2,000 tokens/second. One provider's benchmarks for the Llama 3.3 70B FP8 model show over 50 tokens/second at a concurrency of one, scaling to handle many more requests per minute. - Issues can arise from suboptimal fusion of operations in the underlying inference kernels. An open issue in the TensorRT-LLM GitHub repository notes that FP8 quantization operations are not always fused with the main matrix multiplication (GEMM) kernels. This lack of fusion can introduce overhead by requiring separate scaling operations before every computation, preventing the hardware from reaching its maximum theoretical throughput.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.