LLM Inference Prices Fluctuate Wildly
Recent analysis reveals a vast price range for LLM inference, with OpenAI's open-source GPT-OSS-20B being the cheapest at $0.05 per million input tokens. At the high end, Grok-4 costs $30 per million tokens. Anthropic has intensified competition by dramatically cutting prices for Claude Opus 4.6 and positioning its Sonnet 4.6 model with flagship capabilities at a mid-tier price.
- The price for the same open-weight model can vary by as much as 10x between different inference providers, indicating that inference is not yet a fully commoditized service. This variation is influenced by factors like hardware infrastructure, software optimizations, and the provider's business strategy, with some likely operating at a loss to gain market share. - Output tokens are consistently priced higher than input tokens, often by a 4-8x multiplier, because generating text is more computationally intensive than processing a prompt. For example, Anthropic's Claude Opus 4.6 costs $5 per million input tokens but $25 for output, and a "fast mode" increases this to $30 and $150 respectively. - Hardware specialization is a key driver of cost reduction, with Google's Tensor Processing Units (TPUs) offering up to 4x better price-performance for inference compared to NVIDIA's H100 GPUs on certain workloads. This has led major AI labs like Anthropic and Midjourney to migrate significant workloads to TPUs, with Midjourney cutting its monthly inference spending from $2.1 million to $700,000. - Techniques like model quantization, which reduces the precision of the model's weights, and speculative decoding can dramatically lower operational costs and latency. Combining methods like pruning, quantization, and knowledge distillation has been shown to reduce inference costs by as much as 5x while maintaining over 98% of the original model's accuracy. - The "build vs. buy" decision for AI compute is a major strategic choice, with custom-built ASICs like TPUs offering maximum performance at scale but requiring years of development. In contrast, buying access to general-purpose GPUs offers more flexibility and a mature software ecosystem, which is why NVIDIA's CUDA remains a dominant platform. - The venture capital landscape is pouring funds into startups aiming to disrupt the AI chip market. Toronto-based Taalas recently raised $169 million to develop model-specific processors, and Ricursive Intelligence raised $335 million at a $4 billion valuation to use AI to design the next generation of AI chips. - Despite a rapid decrease in cost-per-token, with the price of GPT-4-level performance falling by 40x per year, the total cost of inference is expected to grow substantially. By 2030, inference is projected to consume 75% of all AI compute resources, creating a $255 billion market. - Anthropic's pricing for Claude Opus 4.6 includes several tiers that can significantly alter the cost. Using the 1 million token context window (currently in beta) for a request over 200k tokens doubles the input price, and a "fast mode" carries a 6x premium. Stacking these features with US-only data residency can make the effective price per token more than 10x the standard rate.