Avoid CPU‑GPU roundtrips with unified memory
- Apple’s Metal docs and WWDC guidance make the real point clear: unified memory removes copies on Apple silicon, but not CPU–GPU stalls. - The key trick is to keep work on one side longer — fuse passes, prefer GPU-private resources when the CPU doesn’t need access, and triple-buffer shared data. - That matters for iPhone, iPad, and Vision Pro apps because unified memory helps, but synchronization, allocation churn, and bandwidth pressure still cost frames.
Unified memory sounds like a free pass. The CPU and GPU share the same physical memory on Apple silicon, so the old “copy data across the bus” story gets much smaller. But the performance problem doesn’t disappear — it changes shape. The expensive part often becomes roundtrips, meaning handoffs, stalls, and cache-unfriendly resource churn, not raw copying. Apple’s Metal docs and talks keep coming back to the same idea: keep data where the work is, and avoid bouncing between CPU and GPU unless you really have to. (developer.apple.com) ### What does unified memory actually fix? On Apple GPUs, CPU and GPU can access the same system memory. In Metal, that usually means `MTLStorageMode.shared` is the default on Apple silicon, and you don’t need the explicit managed-resource sync dance that older discrete-GPU Macs used. That is a real win — fewer copies, simpler code, less bookkeeping. But sh(developer.apple.com)timelines, and preferred memory paths. (developer.apple.com) ### So why do roundtrips still hurt? Because every time the CPU waits for the GPU, or the GPU waits for the CPU, you lose parallelism. That’s the real tax. Apple’s synchronization sample is blunt about it — the goal is to avoid stalls by using multiple instances of a resource so both processors can keep working. If your pipeline does GPU pass, CPU readback, CPU tweak, GPU p(developer.apple.com)ce, not dependency. (developer.apple.com) ### What should stay on the GPU? Anything the CPU doesn’t truly need right now. Apple’s storage-mode guidance says shared memory is great when both processors need access, but private resources are optimized for GPU-only use. On unified-memory systems, `private` still lives in system memory, but it lets Metal place and optimize that resource for GPU access. So if an interme(developer.apple.com)posing it to the CPU just because you can. (developer.apple.com) ### Why does pass fusion help so much? Because the fastest handoff is the one you never make. If two compute stages can run back-to-back on the GPU, or if image processing can stay inside one GPU-driven pipeline, you cut command overhead, synchronization points, and temporary resource traffic. Apple’s recent Metal guidance even frames this as “unify your co(developer.apple.com 1) (developer.apple.com 2) ### What about buffers and allocation churn? That’s the other half of the story. Even with unified memory, creating and discarding buffers aggressively can burn time and inflate memory pressure. Apple pushes developers toward deliberate resource management — pick the right storage mode, use heaps when appropriate, and reuse resources instead of constantly reallocating them. The catch is that memory(developer.apple.com)app keeps touching large shared buffers from both CPU and GPU, you can end up fighting over the same memory instead of moving work forward. (developer.apple.com) ### How do you avoid CPU–GPU stalls in practice? Triple buffering is the classic answer. Apple’s sample uses multiple copies of dynamic resources plus a semaphore so the CPU can prepare the next frame while the GPU renders the current one. That pattern matters more than ever on Apple silicon. Don’t mutate a buffer the GPU is still reading. Don’t force completion just to inspect intermediate results. Give each side enough independent work that they overlap cleanly. (developer.apple.com) ### Why is this especially relevant to Vision Pro and iPhone apps? Because those apps live on tight frame budgets and shared system resources. VisionOS, iOS, and iPadOS all sit on Apple silicon with unified memory, and Metal is the main path to high-performance graphics and compute there. That makes the advice directly practical: less readback, fewer sync points, more GPU-local processing, and steadier resource reuse. (developer.apple.com) ### Bottom line Unified memory is a shortcut around copying, not a shortcut around architecture. The big wins come from avoiding CPU–GPU handoffs, not from assuming shared memory made them cheap. If you keep intermediate work on the GPU and stop reallocating or synchronizing so often, frame time usually gets better fast.