Inference now majority cloud spend
Social reporting shows inference has grown to about 55% of enterprise AI cloud spend, roughly $37.5 billion, shifting buyer focus toward caching, batching and quantization to cut costs. Practical measures like 50–75% savings from quantization and 40–80% throughput gains from batching are being highlighted as the principal levers for inference economics. (x.com (x.com)
Running a model after it is built now takes most of the money in enterprise artificial intelligence cloud budgets. Gartner said 55% of artificial-intelligence-optimized infrastructure-as-a-service spending will go to inference workloads in 2026. (gartner.com) That shift comes as enterprise generative artificial intelligence spending keeps rising. Menlo Ventures said companies spent $37 billion on generative artificial intelligence in 2025, up from $11.5 billion in 2024, though its total excludes inference that earlier editions had counted. (menlovc.com) Inference is the part customers actually use: every chatbot reply, search ranking, fraud check, or recommendation is a fresh run of a trained model. Gartner said inference workloads are becoming dominant because they power continuous, real-time applications rather than one-off training jobs. (gartner.com) That is why buyers are talking less about training clusters and more about serving efficiency. Deloitte wrote in its 2026 technology trends report that falling per-unit inference costs have been outweighed by exploding usage, pushing companies to recalculate infrastructure plans around consumption. (deloitte.com) The first lever is caching, which means reusing work the model has already done instead of recomputing the same prompt prefix. OpenAI says prompt caching is available on recent models and works when requests share an exact prefix, lowering latency and cost for long repeated prompts. (developers.openai.com) Anthropic offers the same basic idea on Claude: reuse a saved prompt prefix so repeated requests do not start from zero. Its documentation says prompt caching reduces processing time and costs for prompts with consistent elements. (platform.claude.com) The second lever is batching, which means packing multiple requests together so one graphics processor stays busy instead of waiting between jobs. vLLM points users to continuous batching as a core throughput feature, and its documentation cites work showing large throughput gains while cutting median latency. (docs.vllm.ai) The third lever is quantization, which means storing model numbers in fewer bits, like shrinking a file so it fits in less memory. vLLM says quantization reduces model precision to cut memory footprint, and NVIDIA’s TensorRT-LLM lists quantization as a built-in optimization for efficient inference on NVIDIA graphics processors. (github.com) (docs.nvidia.com) Those techniques are moving from engineering detail to budget line because inference does not end when training ends. Gartner projects inference’s share of artificial-intelligence-optimized infrastructure-as-a-service spending will rise past 65% by 2029, which means the cost fight is moving to every token served, not just every model trained. (gartner.com)