Hardware advances for LLM inference
NVIDIA's Blackwell Ultra architecture is improving LLM inference efficiency by optimizing core layers like softmax for lower latency. In parallel, AMD has demonstrated running trillion-parameter models on small clusters of consumer-grade AI PCs using distributed inference, highlighting a trend toward more accessible large-model deployment.
- The NVIDIA Blackwell Ultra is a dual-die GPU, featuring 208 billion transistors and connected by a 10 TB/s high-bandwidth interface, effectively functioning as a single, unified CUDA-programmed accelerator. It introduces support for a new 4-bit floating point (FP4) precision format, which delivers 15 PetaFLOPS of dense compute and can reduce the memory footprint by approximately 1.8 times compared to FP8. - A single rack of the GB300 NVL72 system, which integrates 72 Blackwell Ultra GPUs and 36 Grace CPUs, can achieve 1.1 exaflops of FP4 compute, operating as an exascale supercomputer in one node. This system requires significant infrastructure changes, including liquid cooling and 800-gigabit networking, to handle its increased power and thermal density. - Compared to the previous Hopper generation, the DGX B200 system with eight Blackwell GPUs provides up to 15 times the inference performance. For large language models, this translates to a potential 25-fold reduction in both cost and energy consumption. - AMD's Instinct MI355X GPU, with 288 GB of HBM3e memory, is positioned as a strong competitor for inference workloads. In some benchmarks with reasoning-focused models like DeepSeek-R1, the MI355X has demonstrated 12% higher throughput per GPU at high concurrency compared to NVIDIA's B200. - The trend of running large models on consumer hardware is enabled by software like llama.cpp, which supports model quantization and can split model layers between GPU VRAM and system RAM. This "CPU offloading" allows models that are too large for a single GPU's VRAM to run, albeit at a reduced speed. - Distributed inference techniques are crucial for running trillion-parameter models and involve various forms of parallelism. Model parallelism splits the layers of a single model across multiple GPUs, while pipeline parallelism processes different micro-batches of data on different GPUs simultaneously to improve utilization. - The market for AI inference chips is projected to surpass $100 billion by 2027, with inference spending expected to exceed that of training. While NVIDIA holds a dominant market share of over 90% in AI GPUs, a growing number of competitors, including startups like Groq and Cerebras, are developing specialized hardware for inference. - The cost of LLM inference has dropped dramatically, with some reports indicating a 280-fold reduction for certain performance levels between late 2022 and late 2024. This is driven by a combination of hardware innovation, algorithmic optimizations, and new software frameworks.