AMD MI350X posts inference benchmarks
- AMD’s MI350X is starting to show up in public inference results, with AMD and developers posting early numbers for GPT-OSS and other large-model serving. - The headline detail is memory: 288GB of HBM3E per GPU and 8TB/s bandwidth, plus AMD-published MI355X scaling results on GPT-OSS-120B. (amd.com) - If those gains hold outside vendor demos, AMD has a real opening in memory-bound inference where fitting bigger models cleanly matters most. (amd.com)
Inference GPUs are turning into a memory fight as much as a raw-compute fight. That is the backdrop for the MI350X story. The interesting part is not just that AMD has a new accelerator. It’s that public benchmark crumbs are finally showing where this thing might matter most — large-model inference, (amd.com)product pages and ROCm materials now give enough detail to sketch the bet. (amd.com)MD’s 4th-gen CDNA data-center GPU. The important specs are simple: 288GB of HBM3E memory per GPU, 8TB/s of memory bandwidth, and support for lower-precision AI datatypes like MXFP6 and MXFP4. In plain English, AMD is pushing this as a part built for modern inference and training, not just classic HPC. (amd.com) ### Why does the memory number matter so much? Because inference breaks in annoyin(amd.com)ross devices, latency rises and utilization gets worse. A 288GB GPU gives operators more room to keep larger models or longer contexts resident without heroic sharding tricks. AMD’s platform docs also show the eight-GPU MI350X system reaching about 2.3TB of total HBM, which is the kind of number hyperscalers notice immediately. (instinct.docs.amd.com([amd.com)l)) ### Did AMD actually post inference benchmarks? Yes — but mostly through AMD ROCm performance pages and blog posts centered on the MI355X, the sibling part in the same family, rather than a neat single MI350X launch chart for every workload. AMD’s ROCm performance hub now aggregates inference results, and a recent ROCm blog showed MI355X results on GPT-OSS-120B, DeepSeek-R1, Qwen3-235B, and Llama-3.3-70B using vLLM. That matters because it moves the conversation from spec-sheet theory to serving throughput under a defined stack. (amd.com) ### What about the 62,000 tokens-per-second claim? I could verify the surrounding ingredients, but not that exact MI350X figure from a primary source I trust enough to state as settled. AMD does have official day-0 GPT-OSS support materials, and independent AMD-focused projects have shown strong GPT-OSS throughput on earlier MI250 hardware. But the specific “62,000 tokens/sec on GPT-OSS 20B” number looks like it is circulating through posts and repos rather than a clean AMD benchmark page I can point to directly. So treat that number as plausible but still provisional. (rocm.blogs.amd.com) ### Why does GPT-OSS keep coming up here? Because GPT-OSS is a useful test case for agentic and reasoning-style workloads that developers actually want to serve. OpenAI’s repo positions gpt-oss-20b and gpt-oss-120b as open-weight models for reasoning and agentic tasks, and AMD made a point of supporting them on ROCm from day one. That pairing is strategic — AMD wants to show that new open models land on its stack quickly, not months later. (github.com) ### Is this really ab(rocm.blogs.amd.com)OCm materials explicitly compare MI355X against Nvidia’s B200 on several inference workloads and claim wins at scale on some of them. Even where MI350X itself is not the exact SKU in the chart, the family message is obvious: AMD thinks memory-heavy, high-concurrency inference is the wedge. (rocm.blogs.amd.com) ### What’s the catch? Software maturity still (github.com)t production buyers care about kernel quality, model coverage, deployment tooling, and how much hand-tuning it takes to hit the pretty numbers. AMD is clearly improving here — ROCm docs now include benchmark recipes and model-specific guidance — but the burden of proof is still on repeatable real-world deployments. (rocm.docs.amd.com)The hard evidence today is strong memory density, official ROCm support, and growing inference data from AMD’s MI350 family. The softer claim — huge GPT-OSS 20B throughput on MI350X itself — may turn out to be real, but it still needs cleaner validation. If that validation arrives, AMD’s pitch gets very sharp very fast: fewer memory compromises, more usable open-model inference, and a much stronger case for buyers to split accelerator spend. (amd.com)