H100 benchmarks favour small models
- Recent H100 benchmarks showed a 2‑billion‑parameter model achieving about 14× the throughput of a 31‑billion‑parameter model on the same hardware in specific tests. - Using NVIDIA FP8 quantization reportedly boosted throughput roughly 73%, underscoring quantization gains for inference efficiency. - The results suggest dense large models aren't always best for serving; smaller SLMs plus FP8 can be far more cost‑effective for high‑throughput workloads. (x.com/Gaurav_vij137/status/2047967372565492092)