Llama 70B FP8 Inference Faces Real-World Bottlenecks

Practitioners are reporting performance issues when deploying Llama 70B models with FP8 precision on high-end hardware. Despite theoretical throughput gains, some deployments are stalling at just 3 tokens per second. The bottleneck underscores that real-world performance still requires careful tuning of model architecture and hardware configuration, even with advanced numerical formats.

- FP8 support is not universal across GPUs; it requires hardware acceleration from 4th-generation Tensor Cores found in NVIDIA's Hopper (e.g., H100) and Blackwell (e.g., B200, GB10) architectures. Older architectures like Ampere have limited or no native support for FP8 computations on both weights and activations. - The performance gains of FP8 are unlocked by specific software libraries like NVIDIA's TensorRT-LLM and vLLM, which have incorporated support for FP8 quantization. These frameworks are necessary to translate the model's operations into the hardware's native FP8 instructions. - The bottleneck in token generation for large models is frequently memory bandwidth, not raw compute (FLOPs). The process is often "memory-bound" because each new token requires loading the model's weights from high-bandwidth memory (HBM) to on-chip SRAM, and the speed of this data transfer is the limiting factor. - While model weights for a 70B parameter model in FP8 require about 70GB of VRAM, the Key-Value (KV) cache is a major and often overlooked consumer of memory. The KV cache grows linearly with the sequence length and batch size, quickly exhausting remaining VRAM and forcing slower memory swaps, which can cause performance to plummet. - There are two main approaches to FP8 quantization: static and dynamic. Static quantization, where scaling factors are pre-calculated with a calibration dataset, generally offers better performance. Dynamic quantization calculates scaling factors on-the-fly for each activation, which can preserve accuracy but introduces computational overhead that may reduce overall throughput. - Low token-per-second rates are often observed with a small batch size or a single concurrent request, which fails to saturate the GPU's computational resources. Inference engines like vLLM use techniques like in-flight batching to process multiple requests concurrently, which amortizes the cost of memory access and significantly improves overall throughput. - The specific hardware in the forum post, an NVIDIA GB10, uses a unified memory architecture. This differs from the discrete HBM3 memory found on H100 GPUs, meaning system RAM and GPU memory are shared, which can lead to contention and different performance characteristics compared to systems with dedicated GPU memory. - For reference, NVIDIA's own benchmarks for a Llama 3.1 70B model using FP8 on eight H100 GPUs with TensorRT-LLM showed a 1.5x throughput speedup compared to its BF16 version. This highlights that performance gains are typically realized in multi-GPU, high-throughput scenarios.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.