Nvidia Claims 10x Reduction in AI Inference Costs

Nvidia's latest platform update claims a tenfold reduction in the cost of AI inference. The company attributes the improvement to a combination of new GPU hardware and software stack optimizations, including advances in TensorRT-LLM and vLLM. Early adopters have reportedly seen significant improvements in token-level cost efficiency and throughput.

- The cost reduction is largely attributed to the new Blackwell GPU architecture, which, when combined with the NVFP4 data format, can decrease the cost per token from 20 cents on the prior Hopper platform to 5 cents. - For agentic AI and coding assistant workloads, the Blackwell Ultra systems can cut the cost per token by up to 35 times and increase throughput per megawatt by up to 50 times compared to the Hopper platform. - The upcoming "Rubin" platform is projected to offer another 10x performance increase and an additional 10x cost reduction over the Blackwell generation, further lowering the barrier for scaling AI services. - This hardware is supported by a suite of software including NVIDIA NIM inference microservices and the NVIDIA AI Enterprise platform, which are designed to streamline the deployment of optimized AI models in production environments. - A key software component is TensorRT-LLM, an open-source library that provides optimizations like in-flight batching, paged key-value caching, and quantization to FP8 and FP4 for efficient inference on NVIDIA GPUs. - For developers, a new beta feature in TensorRT-LLM called AutoDeploy automatically compiles PyTorch models into optimized inference graphs, reducing the manual effort and time required for deployment. - Leading inference providers like Baseten, DeepInfra, Fireworks AI, and Together AI have already adopted the Blackwell platform, reporting up to a 10x reduction in their cost per token. - In a recent major partnership, Meta will be deploying millions of NVIDIA's Blackwell and upcoming Rubin GPUs, in addition to adopting NVIDIA's Grace and Vera CPUs and Spectrum-X Ethernet networking for its AI infrastructure.

Nvidia Claims 10x Reduction in AI Inference Costs

Get your own daily briefing