Agents strain datacenter model

- Nvidia’s recent AI-Q and Dynamo releases frame “agentic” AI as a different datacenter workload: long-running software that plans, calls tools, and reuses context across many model requests instead of one-shot chats. - In Nvidia’s own agentic-inference data, coding tools like Claude Code and Codex can make hundreds of API calls per session, with 85% to 97% cache hits and an 11.7x cache read-write ratio. - The bottleneck is shifting from raw chip speed to cache, routing, memory tiers, and network transfer in distributed inference systems. (developer.nvidia.com)

AI agents are pushing datacenters to optimize for memory, routing, and cache reuse, not just faster chips. (developer.nvidia.com) Nvidia’s recent AI-Q blueprint and Dynamo software describe agents as long-running systems that classify requests, choose tools, call other services, and sometimes escalate from a quick lookup to a deeper multi-step workflow. (docs.nvidia.com) (developer.nvidia.com) That is different from a single chat prompt sent to one model and answered in one pass. In Nvidia’s AI-Q design, an intent classifier can route a query to a shallow researcher, a clarifier, or a deep researcher with planner and researcher subagents. (docs.nvidia.com) The infrastructure problem starts with context. Large language models store prior tokens in a key-value cache, a working memory that grows with prompt length and has to stay available if the next step in the workflow will reuse it. (developer.nvidia.com) Nvidia said agentic coding sessions can make hundreds of API calls while carrying the full conversation history forward. In one example, subsequent calls to the same worker hit 85% to 97% cache reuse, and a four-agent team reached a 97.2% aggregate hit rate. (developer.nvidia.com) That produces what Nvidia calls a write-once, read-many pattern: the system computes a long prompt prefix once, then keeps reading it back across many steps. Nvidia said one measured workload showed an 11.7x read-write ratio in cache activity. (developer.nvidia.com) Once that happens, the chokepoints move. Dynamo’s documentation says the system around the model now has to handle scheduling, key-value cache management across memory tiers, and low-latency data transfer between nodes. (docs.nvidia.com) That is why Nvidia has been emphasizing “disaggregated serving,” which splits the prefill step that builds the cache from the decode step that generates tokens. The company says those phases can then be scaled independently, but only if routing and data transfer are fast enough. (docs.nvidia.com 1) (docs.nvidia.com 2) The tradeoffs are concrete. Nvidia’s Dynamo docs say moving key-value cache without Remote Direct Memory Access can cause a 40x performance degradation in disaggregated deployments, and its autoscaling team says queries-per-second is a poor load metric when long inputs and long outputs stress different GPU pools. (docs.nvidia.com) (developer.nvidia.com) Nvidia is also arguing that this favors more flexible deployment models. Its March 18, 2026 AI-Q enterprise-search post pitches on-premises control and private-data handling, while Dynamo is built to mix routing, offloading, and engine choices around the model. (developer.nvidia.com) (docs.nvidia.com) The upshot is that agent scale now depends less on a benchmark for one forward pass and more on whether the whole stack can keep state warm, move it cheaply, and schedule the next step before users notice. (developer.nvidia.com) (docs.nvidia.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.