Open-Source LLMs See 10x Cost Reduction
Leading AI inference providers are reporting up to a 10x reduction in the cost-per-token for running large language models by using open-source models on NVIDIA's new Blackwell platform. Companies including Baseten, DeepInfra, and Together AI have achieved these cost savings through highly optimized inference stacks, lowering the barrier for startups to build and scale AI-powered products.
- The performance gains stem from the Blackwell architecture's multi-die design, which packs 208 billion transistors compared to the 80 billion in the previous Hopper generation. This is combined with a second-generation Transformer Engine and new 4-bit floating point (FP4) AI inference capabilities, which double the performance and model size that can be supported while maintaining accuracy. - The NVIDIA GB200 NVL72, a rack-scale system connecting 72 Blackwell GPUs, acts as a single massive GPU. NVIDIA claims this system delivers up to 30 times the LLM inference performance of a comparable system using the previous H1