HPE Alletra X10000 speeds LLM cache
- HPE, Kamiwaza, NVIDIA, and Signal65 are pushing the Alletra Storage MP X10000 as a shared KV-cache tier for LLM inference, using S3 over RDMA. - The headline number is up to 21.5× faster time-to-first-token versus no KV-cache offload, and 5.6× versus memory-only offload in HPE-backed tests. - That matters because shared cache turns object storage from a cold data layer into live inference infrastructure for multi-node, hybrid AI deployments.
Object storage is not supposed to be the fast part of an LLM stack. It is supposed to be the big, cheap, durable part. But HPE is trying to flip that assumption. The pitch is that its Alletra Storage MP X10000 can act as a shared KV-cache layer for inference — not just a place to park files — and that doing this over S3 with RDMA can cut the wait for a model’s first token hard. The claim comes from a March 2026 paper and follow-on material from HPE, Kamiwaza, NVIDIA, and Signal65. ### What is the thing HPE is actually selling? The X10000 is HPE’s scale-out unstructured storage system — basically object storage built for big AI and data-lake workloads, with S3-compatible access and a disaggregated architecture meant to scale capacity and performance separately. HPE has been positioning it for AI data pipelines already, but the new angle is more aggressive: use it in the hot path of inference, not just upstream of it. (hpe.com) ### What is KV cache, and why does it matter? KV cache is the model’s memory of the tokens it has already processed. If a prompt shares a long prefix with work the model has seen before, reusing that cache means the GPUs do not have to recompute the whole prefill step. That is where time-to-first-token gets burned. In clustered serving, the problem is that cache usually lives close to one server’s memory, so another server cannot easily reuse it after a restart, a rebalance, or a handoff. (hpe.com) ### Why bring object storage into that path? Because a shared cache pool changes the economics. Kamiwaza’s description is pretty direct — multiple vLLM servers can draw from the same cache, prefixes survive server restarts, and cache created by one node becomes available to others. That means less redundant prefill across the cluster, not just inside one box. Basically, object storage becomes a common memory layer for inference. (signal65.com) ### Why does RDMA matter so much here? The catch with object storage is latency. Traditional S3 over TCP carries too much software overhead if you want cache behavior that feels anywhere near memory. RDMA helps by moving data with less CPU and network-stack tax. HPE and related research have been circling this idea for a while — offloading KV movement closer to the network path so GPUs spend less time waiting and more time generating tokens. (kamiwaza.ai) ### What did the testing actually show? The strongest published number is not a flat “20× faster than local storage.” It is more specific. HPE’s white paper says the X10000 as a secondary KV-cache showed up to 21.5× lower time-to-first-token versus systems with no KV-cache offload, plus up to 19.4× higher output token generation. Against memory-only offload, the gain was smaller but still big — 5.6× lower TTFT and 5.9× higher token rate. Those are relative results and depend on model, prompt, and GPU setup. (cug.org) ### So is this replacing GPU memory? No — and that is the important nuance. GPU memory is still the fastest tier. System memory is still closer than storage. What HPE is selling is a larger shared layer underneath them, where cache can persist and travel across nodes. Think of it less like replacing RAM and more like adding a fast, networked extension that many inference servers can see at once. That is why the comparison against “no offload” looks dramatic, while the comparison against memory-only offload is the more honest measure of the storage tier itself. (hpe.com) ### Where does this matter most? Multi-tenant inference, agentic workloads, and hybrid environments are the obvious targets. Those are the places where prompts repeat, servers churn, and cache locality breaks down. If you can keep prefixes warm across a fleet, you waste fewer GPU cycles on recomputation. That is useful for enterprise model serving and analytics stacks where the same context gets hit over and over. (hpe.com) ### Bottom line? The interesting part is not just that HPE posted a big benchmark. It is that storage vendors are trying to move object systems into the live inference loop. If that works beyond vendor-led testing, the design tradeoff changes — shared storage stops being the slow back room and starts looking like part of the model’s working memory. (hpe.com) (signal65.com)