Quantized Llama 70B models show poor throughput
Engineers are reporting significant performance bottlenecks with quantized large language models in real-world serving environments. Llama 70B 3.3 Instruct models using FP8 precision are reportedly stuck at just 3 tokens per second, even on high-end DGX Spark and GB10 GPU clusters. The reports indicate that prompt length, context window, and model architecture remain primary bottlenecks despite advances in quantization.
- The NVIDIA DGX Spark and GB10 systems are built on a unified memory architecture, sharing 128GB of LPDDR5X memory between the CPU and GPU. This design is excellent for fitting large models that would exceed the VRAM of discrete GPUs, but the LPDDR5X memory has lower bandwidth than the GDDR memory found in high-end server GPUs, creating a bottleneck during the memory-intensive token generation phase. - FP8 quantization is not a "free" performance gain; the process of quantizing and dequantizing weights and activations on-the-fly consumes compute cycles. This overhead can account for up to 30% of the total execution time for the matrix multiplication (GEMM) operations at the core of the transformer. - The choice between inference servers like vLLM and TensorRT-LLM involves a trade-off. TensorRT-LLM is optimized by NVIDIA for its hardware and can achieve higher throughput with FP8 on stable workloads, while vLLM provides greater