NAND could be the new HBM
NVIDIA’s new Interconnect and Memory Subsystem (ICMS) is expected to push more cache from expensive HBM into NAND storage — a structural shift that would raise demand for flash memory and directly benefit suppliers like Micron, SanDisk, and SK Hynix. In short, moving KV cache to cheaper NAND changes the memory-cost math for large models and creates a tailwind for NAND makers. (x.com)
A large language model keeps a running notepad of every token it has already seen, so it does not have to reread the whole conversation for every new word. That notepad is called the key-value cache, and recent papers describe it as a first-order memory bottleneck because its size grows linearly with context length. (arxiv.org) That cache usually lives in high-bandwidth memory, which is the ultra-fast memory stacked next to the graphics processor. NVIDIA’s own inference material says key-value cache offload exists because graphics processor memory runs out before many real workloads do. (developer.nvidia.com) High-bandwidth memory is fast for the same reason a countertop is fast: the food is already within arm’s reach. The problem is that high-bandwidth memory is also the most expensive real estate in an artificial intelligence server, so every extra token of context burns premium memory that could have held more users or a bigger model. (developer.nvidia.com) NVIDIA’s new answer is to add a middle shelf made of flash storage instead of trying to keep everything on the countertop. In March 2026, NVIDIA introduced Context Memory Storage, powered by BlueField-4, as a flash-based “G3.5” memory tier built specifically for key-value cache. (developer.nvidia.com) Flash storage here means the same NAND chips used inside solid-state drives. NVIDIA says this tier is Ethernet-attached, shared across a pod, and sized for petabytes of context capacity rather than the far smaller pool available in on-package high-bandwidth memory. (nvidia.com) The trick is not to replace high-bandwidth memory, but to move the colder part of the notepad out of it. NVIDIA describes the flash tier as close enough to pre-stage context back into graphics processor and host memory without stalling token generation, which means the hot lines stay nearby and the older lines move to cheaper storage. (developer.nvidia.com) NVIDIA is also shrinking the notepad before it moves it. Its NVFP4 key-value cache format, introduced for Blackwell graphics processors, cuts key-value cache memory footprint by up to 50% with less than 1% accuracy loss on the benchmarks NVIDIA published. (developer.nvidia.com) Put those two changes together and the memory math shifts fast. A provider can keep the fastest memory focused on active decoding while using cheaper NAND flash for the overflow, shared history, and multi-turn agent context that would otherwise crowd out throughput. (developer.nvidia.com) That is why NAND suppliers are suddenly part of an artificial intelligence story that used to belong mostly to high-bandwidth memory vendors. NVIDIA’s own product page says Context Memory Storage is built with an ecosystem that includes storage platform partners, and industry coverage in January 2026 tied the design directly to large pools of NVMe solid-state drive capacity. (nvidia.com) (blocksandfiles.com) Micron, SanDisk, and SK hynix all sell the NAND flash that goes into enterprise solid-state drives, so a world with pod-scale flash for inference context means more demand for the chips they already make. The bet is not that NAND replaces high-bandwidth memory, but that every big inference cluster starts needing a second memory economy built on flash. (micron.com) (sandisk.com) (skhynix.com) This only works if flash is fast enough for key-value cache traffic, which is why NVIDIA paired the storage tier with BlueField-4 data processors and Spectrum-X networking rather than ordinary shared storage. NVIDIA’s January 5, 2026 announcement said BlueField-4 handles key-value cache placement in hardware to cut metadata overhead and data movement. (investor.nvidia.com) For years, the limiting part of artificial intelligence hardware was getting enough compute and enough high-bandwidth memory into one box. NVIDIA is now designing for a different limit: how to keep huge, shared conversation history alive across an entire pod without paying high-bandwidth-memory prices for every token. (developer.nvidia.com)