GPU microarchitecture signals
Discussion is growing about specialised inference silicon and substrate‑agnostic model strategies—examples mentioned include companies pushing SRAM‑heavy chips and inference‑optimized designs, and analysts say the competitive moat may shift toward data and fine‑tuning rather than raw substrate advantage (x.com) (x.com) (x.com).
A graphics processor unit is built to train models and draw graphics; inference chips are built to answer prompts fast and cheaply once the model is already trained. Microsoft, Google, Amazon, Groq and others are now describing hardware around that second job. (blogs.microsoft.com) Microsoft said on January 26, 2026 that its Maia 200 was “built for inference,” with 216 gigabytes of HBM3e memory, 7 terabytes per second of bandwidth and 272 megabytes of on-chip static random-access memory, or SRAM, to keep data close to the chip. The company also said Maia 200 adds specialized data-movement engines and a network built on standard Ethernet. (blogs.microsoft.com) Groq is making a similar pitch from a different architecture. NVIDIA’s product page for Groq’s latest LPU says each accelerator has 500 megabytes of SRAM and 150 terabytes per second of SRAM bandwidth, while Groq said in August 2025 that its design is aimed at low-latency language-model inference rather than training. (nvidia.com) (groq.com) Google and Amazon are also framing their custom silicon around inference economics. Google said its Ironwood Tensor Processing Unit is designed for “high-throughput, low-latency inference,” and Amazon says Trainium 1, 2 and 3 target both training and inference at lower cost. (cloud.google.com) (aws.amazon.com) The hardware argument centers on memory, not just math. Large language models spend much of inference moving weights and cached tokens around, so chipmakers are emphasizing SRAM, memory bandwidth, interconnects and software stacks that keep expensive compute units busy. (blogs.microsoft.com) (groq.com) (cloud.google.com) That is also why “substrate-agnostic” strategies are getting more attention. Microsoft said the Maia software kit includes PyTorch integration, a Triton compiler and low-level controls for porting models across heterogeneous accelerators, while OpenAI’s developer guidance now groups fine-tuning and distillation under model optimization rather than any single hardware target. (blogs.microsoft.com) (developers.openai.com) Distillation is the clearest example of that shift. OpenAI launched a model-distillation workflow on October 1, 2024 so developers could use a larger “teacher” model to train a smaller, cheaper “student” model, and a February 2026 benchmark paper found distilled models improved the performance-to-compute curve against non-distilled baselines. (openai.com) (arxiv.org) Researchers have been arguing for years that data can beat extra compute under a fixed budget. A 2023 paper comparing distillation with human annotation found that, in some settings, spending on more labeled fine-tuning data could be more cost-efficient than using extra graphics processor time to distill a compact model. (arxiv.org) None of this means graphics processors are disappearing. NVIDIA’s March 16, 2026 Rubin POD announcement still centers rack-scale systems for “high throughput” and “low-latency inference,” and AMD said in March 2026 that inference performance depends on workload, latency targets and cost-per-token, not one benchmark. (developer.nvidia.com) (amd.com) Cerebras shows the other side of the market: bigger, more specialized silicon that tries to remove bottlenecks by changing the chip itself. Its wafer-scale engine spans an entire wafer, and the company says the WSE-3 packs 4 trillion transistors and scales to 2,048 nodes for large-model workloads. (cerebras.ai) The fight, then, is no longer just whose chip is fastest. It is increasingly about who can pair the right hardware with the best routing, caching, distillation, fine-tuning data and software so a model is cheap enough to serve at scale. (openai.com) (developers.openai.com) (amd.com)