Agentic AI is breaking GPU memory assumptions
Frank Denneman flagged that KV cache accumulation from multi‑step agent reasoning and external tool calls forces a rethink of GPU memory planning for durable sessions argued. In short — workloads that look small at start can balloon state and pin memory across sessions, changing how you size and schedule GPUs.
Frank Denneman published "Durable Agentic AI Sessions in GPU Memory" on March 12, 2026, laying out how multi‑step agent transcripts convert into persistent KV state across steps. frankdenneman.nl Agentic workloads routinely build context windows in the 30K–64K token range, inflating KV cache sizes and turning prefill time into a dominant latency and memory cost driver. hackernoon.com NVIDIA has moved to make KV cache offload part of the storage tier by standardizing NVMe‑resident inference context and promoting a BlueField‑4 powered Inference Context Memory Storage (ICMS) for pod‑level context memory on Rubin clusters. blocksandfiles.com Academic and systems work is converging on predictive and tiered KV management: KVFlow (workflow‑aware eviction), Tetris (predictive offload and layerwise transfer), Continuum (TTL‑based pinning that improved job completion on Llama‑3.1 8B/70B in its eval), and SideQuest (model‑driven long‑horizon scheduling). arxiv.org Commercial stacks are pushing cluster‑level caching: Crusoe’s MemoryAlloy claims up to 9.9× faster time‑to‑first‑token and >5× throughput via distributed KV caching, while vendors like Weka and Pynomial/NVIDIA Dynamo advertise augmented memory grids and tiered KV offload to expand effective cache capacity. crusoe.ai Practical ops patterns are emerging: TTL pinning and selective KV eviction reduce cascading preemption but require scheduler changes, and systems that exploit NVLink/Grace unified memory enable CPU fallback for oversized caches at the cost of higher latency and different rack/network design. arxiv.org