Fuse ops = memory wins

A key optimization shared for Apple Silicon GPUs: fuse operations to reduce memory reads and materially speed up local LLM inference by cutting unnecessary memory traffic. The tip underscores that memory bandwidth and dispatch efficiency are often the dominant levers for on‑device performance. (x.com)

FlashInfer’s authors measured up to 2×–3× speedups for fused Grouped‑Query and Fused‑RoPE attention kernels on A100/H100 hardware versus vLLM’s unfused implementations, showing fusion can be a multiplicative throughput win for attention-heavy workloads. (flashinfer.ai) Apple’s Core ML on‑device Llama 3.1 8B example states the GPU path is “usually constrained by memory bandwidth” and reports about ~33 tokens/sec on an M1 Max after applying on‑device model optimizations. (machinelearning.apple.com) Fused kernels explicitly reduce memory traffic by merging multiple ops so intermediate tensors are loaded and stored fewer times, a pattern documented in practical profiling writeups and kernel‑engineering guides. (tanaymehta.com) Frameworks and production engines implement this at the compiler/runtime layer: vLLM applies torch.compile/Inductor‑level fusion passes to separate optimizations from model code, enabling those kernel merges without changing model definitions. (docs.vllm.ai) vLLM’s open‑source runtime contains modular fused MOE and attention kernels in its repository (e.g., fused_moe modular_kernel implementations), illustrating how teams ship fused kernels as pluggable runtime components. (github.com) Apple’s profiling docs point to Instruments and Metal debugger counters that report GPU memory bandwidth in GB/s and recommend inspecting buffer/texture bandwidth usage to find passes worth fusing, while community measurements show practical GPU working‑set limits (≈75% of RAM → ~96 GB usable on a 128 GB system) that constrain on‑device model sizing. (developer.apple.com) (stencel.io)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.