CPUs resurge for AI inference
- Intel and SambaNova turned the “CPUs are back” rumor into an actual product on April 8, unveiling a split inference stack for agentic AI. - Their design gives GPUs the prompt-heavy prefill work, SambaNova RDUs the token-by-token decode path, and Xeon 6 CPUs the host and tool-execution role. - That matters because inference is fragmenting by bottleneck now — memory, latency, and I/O matter as much as raw FLOPS.
CPUs are not taking AI back from GPUs. That’s the wrong frame. What’s happening is narrower and more interesting — inference is getting split into parts, and CPUs are suddenly useful again in the parts accelerators don’t handle cleanly. The clearest signal came on April 8, when Intel and SambaNova announced a heterogeneous inference blueprint: GPUs for prefill, SambaNova RDUs for decode, and Intel Xeon 6 CPUs as both host and “action” CPUs for agentic workloads, with availability planned for H2 2026. (sambanova.ai) ### What changed? For the last two years, the default assumption was simple: if you want serious LLM inference, you buy more GPUs. But production serving has exposed a messier reality. An LLM request is not one uniform workload. It has a prompt-processing phase called prefill and a token-generation phase called dec(sambanova.ai) explicitly separating them. (docs.vllm.ai) ### Why does that help CPUs? Because CPUs are not competing head-on with GPUs on dense matrix math. They are winning back the orchestration layer around inference — request routing, tool use, memory movement, networking, storage access, and the general-purpose work that shows up when “chatbot” turns into “agent.” Intel and SambaNova are saying exactly that: (docs.vllm.ai)accelerator handles the model-specific hot path. (sambanova.ai) ### What is prefill, really? Prefill is the part where the model reads the whole prompt and builds the KV cache it will use later. That stage is heavy on parallel compute. It likes hardware that can chew through a lot of matrix math at once. That is why GPUs are still the obvious fit there, and why disaggregated serving systems often keep prefill on accelerator-heavy nodes. (docs.vllm.ai) ### And decode? Decode is the opposite shape. Once the prompt is ingested, the model generates one token at a time while repeatedly reading from memory. That makes the phase much more sensitive to memory bandwidth, cache behavior, and latency consistency. NVIDIA’s own disaggregated inference guidance calls decode memory-bandwidth-bound, and llm-d and vLLM b(docs.vllm.ai)better latency control. (developer.nvidia.com) ### So are CPUs doing decode too? Sometimes — but the bigger story is hybridization, not CPU-only triumphalism. Intel has been pushing prefill/decode decoupling in its Gaudi software stack, and SambaNova’s April design pushes decode onto its RDU while leaving CPUs to run the surrounding system logic. In other words, the industry is not converging on “one best chip.” It is converging on “match each phase to its bottleneck.” (community.intel.com) ### Why is software suddenly talking this way? Because the serving stack has matured enough to expose the tradeoffs. vLLM’s disaggregated prefill feature exists to tune time-to-first-token separately from inter-token latency and to reduce tail-latency(community.intel.com)e split visible, hardware specialization follows fast. (docs.vllm.ai) ### Is this bigger than Intel and SambaNova? Yes. Google’s new eighth-generation TPU family now separates training and inference more explicitly, with TPU 8i positioned around inference and the “memory wall,” not just raw compute. That does not prove CPUs are taking over, but it does show the whole market is redesigning around inference bottlenecks instead of training-era bragging rights. (cloud.google.com) ### What’s the bottom line? The comeback is real, but it is not a CPU revenge story. It is an architecture story. Inference is turning into a pipeline, not a monolith — and once that happens, CPUs matter again because moving data, calling tools, and keeping latency predictable are first-class jobs, not leftovers. (newsroom.intel.com)xeon-6))