Cloudflare's Infire runs Llama 4 on 2 H200s

- Cloudflare said its Infire inference engine can now run Meta’s Llama 4 Scout on just two Nvidia H200 GPUs inside Workers AI. - The telling detail is memory headroom — Cloudflare says that two-H200 setup still leaves more than 56 GiB for KV cache. - That matters because long-context open models usually demand fatter clusters, and Cloudflare is trying to squeeze them into edge-friendly deployments.

Cloudflare is talking about inference infrastructure — the software layer that actually keeps big AI models running once developers start sending real traffic. That layer matters more than the model card does, because if serving is too wasteful, long-context models get expensive fast. The gap has been pretty obvious for a while: open models are improving quickly, but serving them efficiently across a distributed network is still hard. What changed is that Cloudflare says its in-house engine, Infire, can now run Meta’s Llama 4 Scout on only two Nvidia H200 GPUs, with enough memory left over to support a very large KV cache. (blog.cloudflare.com) ### What is Infire? Infire is Cloudflare’s own LLM inference engine, written in Rust, built because the company decided existing serving stacks were not efficient enough for a globally distributed network. The pitch is simple: use less GPU memory, less CPU overhead, and less wasted work, so the same hardware can serve more requests. In Cloud(blog.cloudflare.com)LLM 0.10.0 on an unloaded H100 NVL box, and that the gains were larger under real production load. (blog.cloudflare.com) ### Why do two H200s matter? Because that is a small footprint for a model in this class. Llama 4 Scout is the smaller of Meta’s two launch Llama 4 models, but “smaller” is a little misleading — it is still a 109B-parameter mixture-of-experts model with 17B active parameters and 16 experts. These models are designed (blog.cloudflare.com)eal pressure on memory and serving software. (blog.cloudflare.com) ### What is the KV cache, and why mention it? The KV cache is the memory the model uses to remember prior tokens while it generates the next ones. For long prompts and long conversations, that cache becomes the real bottleneck. Cloudflare’s specific claim is the interesting part here: on two H200s, Infire can run Llama (blog.cloudflare.com)an 1.2 million tokens. Basically, the headline is not just “the model fits.” It is “the model fits with room left to be useful.” (blog.cloudflare.com) ### Why is this hard on a distributed network? Inference has two very different phases — prefill and decode. Prefill chews through the input prompt and is mostly compute-bound. Decode generates the answer token by token and is mostly memory-bound. Those phases want different parts of the GPU, but they block each other if you handle them naiv(blog.cloudflare.com)nt has to juggle both phases while also minimizing internal memory overhead, or else the hardware looks bigger on paper than it does in practice. (blog.cloudflare.com) ### Is this only about benchmarking? No — Cloudflare has already exposed Llama 4 Scout on Workers AI, and the company is clearly trying to show that its platform can host frontier-ish open models without requiring giant centralized GPU pods for every workload. The same April push also highlighted larger-model hosting on Workers AI, including(blog.cloudflare.com)after more optimization work. That gives the two-H200 Llama 4 setup more weight — it sits inside a broader effort to make bigger open models practical on Cloudflare’s network. (blog.cloudflare.com) ### Does this mean “edge inference” in the strict sense? Not exactly in the tiny-box sense people sometimes imagine. H200s are still datacenter GPUs, and two of them are still serious hardware. But compared with the huge clusters many people associate with long-context serving, this is a compact deployment story. The po(blog.cloudflare.com)nk the serving unit enough that more locations in its network can plausibly host it. That is the edge angle. (blog.cloudflare.com) ### What is the real takeaway? The interesting news is not just “Cloudflare runs Llama 4.” It already offered Llama 4 Scout on Workers AI. The new part is the efficiency claim — that Infire can fit the model onto two H200s and still preserve a huge context budget. If that holds up in production, it means the competitive edge in open-model pl(blog.cloudflare.com) the serving software that wastes the least hardware. (blog.cloudflare.com)

Cloudflare's Infire runs Llama 4 on 2 H200s

Get your own daily briefing