AMD APU memory causes 2.2x slowdown

- AMD APU users traced a big llama.cpp slowdown to HIP mapped host memory, where tensors landed in GTT instead of faster local VRAM paths. - One reported decode run jumped from roughly 45 to 100 ms per token, and ROCm docs explain mapped host memory is direct-access host RAM. - It matters because unified-memory APUs blur “GPU memory” semantics, so one allocator flag can quietly wreck inference latency.

AMD’s APU memory story is a good example of how “shared memory” can sound simple and behave anything but simple. On paper, an APU gives the CPU and GPU access to the same system RAM. In practice, the exact allocation path still decides whether the GPU sees something like local working memory or more like zero-copy host memory. That distinction just surfaced in developer testing around llama.cpp on ROCm, where mapped host allocations were tied to a big decode slowdown on AMD APUs. (rocm.docs.amd.com) ### What actually tripped people up? The trigger was `hipHostMalloc` with mapped host memory semantics. ROCm’s own docs say `hipHostMallocMapped` allocates pinned host memory into the device address space, so the GPU can access it directly without a normal copy step. That sounds great — but direct access is not the same thing as fast cache-friendly local GPU memory. (rocm.docs.amd.com) ### Why can “shared memory” still be slow? Because “shared” describes addressability, not performance. ROCm’s programming material has been pretty blunt about this for years: zero-copy host memory avoids copies, but each access can traverse the CPU/GPU interconnect and be far slower than l(rocm.docs.amd.com)d behavior. (github.com) ### Where does L2 come into this? The important bit is cache policy. If a buffer is treated as fine-grained or coherent host memory, the GPU may not get normal cache behavior on it. ROCm’s docs spell out the tradeoff: fine-grained coherence gives CPU/GPU visibility while kernels run, but many A(github.com)-coherent behavior. (github.com) ### Why does decode get hit so hard? LLM decode is brutally sensitive to memory latency. During prompt processing, the GPU can amortize a lot of memory traffic across many tokens. During decode, it is generating one token at a time, repeatedly touching weights and KV data. So if those reads come from a slo(github.com)he user report here — roughly 45 ms/token turning into about 100 ms/token, or around a 2.2x slowdown. The same llama.cpp APU discussion from 2024 already showed that changing HIP memory behavior could swing prompt and eval timings a lot on Ryzen 7940HS hardware. (github.com) ### Is this a ROCm bug or expected behavior? Basically, both. The low-level behavior is not mysterious — mapped host memory is documented as host RAM the GPU can directly access, and ROCm has long warned that zero-copy can be much slower than local memory. But developers are clearly running into a usability gap on APUs, because unified-memory hardware make(github.com)tically. Recent ROCm issues on Ryzen AI APUs ask for better `hipMallocManaged` support or a performant GTT path precisely because current behavior can strand workloads on the wrong side of that tradeoff. (rocm.docs.amd.com) ### Why are APUs especially confusing here? Because the hardware is unified, but the software model still exposes old distinctions — VRAM, GTT, mapped host memory, managed memory, coarse-grain versus fine-grain. On a discrete GPU, developers already expect host memory to be second-class. On an(rocm.docs.amd.com)lity, and coherence rules. (github.com) ### So what’s the practical lesson? Don’t treat allocator choice as plumbing. On AMD APUs, `hipMalloc`, `hipMallocManaged`, and `hipHostMallocMapped` can imply very different cache and coherence behavior, even when all of them ultimately touch the same physical RAM pool. For inference stacks, that means memory placement needs profiling — especially decode latency, not just throughput. (rocm.docs.amd.com) ### Bottom line The headline isn’t just that one AMD APU setup got 2.2x slower. It’s that unified-memory systems still have “fast shared memory” and “slow shared memory” — and one innocent-looking HIP allocation path can be the difference.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.