AMD's integrated stacks

AMD is gaining traction with tightly integrated CPU‑GPU systems that keep the processor and accelerator in the same stack to cut inference latency. ( ) The company’s MI300X/MI325X GPU family ships with 288GB of HBM3E memory and pairs with EPYC “Venice” (Zen 6) CPUs in deployments — an architecture users say reduces round‑trip delays for agentic models and hyperscaler workloads. ( )

In artificial intelligence servers, every trip between a general-purpose processor and a graphics chip adds delay. AMD is selling more systems that put those parts closer together and, in one design, into a single shared-memory package. (amd.com) AMD’s MI300A is the clearest example: it combines 24 Zen 4 central processing unit cores, 228 CDNA 3 graphics compute units, and 128 gigabytes of high-bandwidth memory in one package with a single shared address space. AMD says that layout lets the central processor and graphics processor work from the same memory pool instead of copying data back and forth over a slower link. (amd.com) AMD sells a separate line for bigger language-model jobs. The MI300X ships with 192 gigabytes of high-bandwidth memory, while the MI325X raises that to 288 gigabytes of HBM3E and 6 terabytes per second of memory bandwidth, according to AMD’s June 2, 2024 roadmap update and current product pages. (amd.com, amd.com) The basic tradeoff is simple: large models run faster when more of the model fits in fast on-package memory instead of spilling across a bus to host memory. AMD’s CDNA architecture page says Infinity Fabric is built to tie GPU chiplets and stacked HBM memory together with coherent, high-throughput links inside a device and across multi-device platforms. (amd.com) That matters for inference, the stage where a trained model generates tokens for a user, because each extra memory hop can raise response time. AMD’s ROCm inference guides for vLLM are now written specifically for MI300X, MI325X, MI350X, and MI355X, and they focus on minimizing latency as well as maximizing throughput. (rocm.docs.amd.com, rocm.docs.amd.com) AMD is also tying that accelerator story to its next server processor generation. On April 14, 2025, the company said its next-generation EPYC chip, code-named Venice, was the first high-performance computing product brought up on TSMC’s 2-nanometer process, and AMD later described Venice as a Zen 6 part in its rack-scale AI roadmap. (amd.com, amd.com) In that roadmap, AMD said its future “Helios” rack would combine Instinct MI400 graphics processors, EPYC Venice central processors, and Pensando Vulcano network interface chips. AMD said Venice is expected to offer up to 256 cores and up to 1.6 terabytes per second of memory bandwidth to keep data moving across the rack. (amd.com) The company has kept pushing memory capacity higher in the generations after MI300. AMD’s current MI350 series pages list 288 gigabytes of HBM3E memory and 8 terabytes per second of bandwidth, showing that the design priority is still fitting larger models close to the compute engine. (amd.com, amd.com) The pitch to cloud operators is not just raw speed; it is fewer bottlenecks between the chip doing orchestration and the chip doing matrix math. AMD’s integrated and tightly linked stacks are built around that one idea: move less data, over shorter distances, with more of the model sitting in fast memory from the start. (amd.com, amd.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.