FP8 quantization yields significant throughput gains
Recent technical reports show that training and inference with lower-precision data types like FP8, INT8, and INT4 is becoming standard for optimizing performance on modern GPUs. One report details that using FP8 for training Llama-2 7B resulted in a 34% throughput increase. For models like InternLM2, quantization can yield 30-40% throughput improvements on Hopper and Blackwell GPUs without a loss in accuracy, unlocking significant cost savings.