DualPath Tackles Agentic Serving Bottlenecks

A new serving approach called DualPath aims to solve the KV-cache storage bandwidth bottleneck that plagues agentic LLM workloads. The method pipelines data using GPU-to-GPU RDMA relays with Quality of Service, reportedly boosting offline throughput by 1.87x and online serving capacity by 1.96x. This technique could be key for scaling complex, multi-turn agent interactions.

The core issue DualPath addresses is that in multi-turn, agentic workloads, the performance bottleneck shifts from being compute-bound to I/O-bound. This happens because loading the large Key-Value (KV) cache from external storage saturates the network interface cards (NICs) on the "prefill" engines responsible for processing initial prompts, while the NICs on "decode" engines, which handle token generation, are left idle. This imbalance is a direct result of the architecture used in many disaggregated inference systems. Production data from DeepSeek-AI, a collaborator on the DualPath research along with Peking and Tsinghua Universities, shows that agentic applications can have KV-Cache hit rates as high as 98.7%, making the speed of loading that cache critical. For long-context tasks, the memory required for the KV cache can even exceed the size of the model weights themselves. DualPath's solution is to create a second data path for loading the KV-Cache. Instead of only loading from storage to the prefill engines, it also allows loading the cache into the decode engines. The data is then transferred from the decode engines to the prefill engines using Remote Direct Memory Access (RDMA) over the high-speed compute network, effectively aggregating the bandwidth of all storage interfaces in the cluster. This method leverages GPUDirect RDMA, a technology that enables a direct data path between GPU memory and other PCIe devices, like a network card, bypassing the CPU and system memory. This significantly lowers latency and CPU overhead, which is critical for the high-speed, inter-engine data transfers that DualPath relies on. The system uses a global scheduler to dynamically balance the workload across both the traditional and the new loading paths. The performance gains reported are substantial, with tests on production agentic workloads showing up to a 1.87x increase in offline throughput and a 1.96x average improvement in online serving capacity without violating service level objectives (SLOs). The system has been shown to scale linearly, managing up to 48,000 simultaneous agents while maintaining stable completion times. This addresses a key failure mode for many agentic systems where latency and costs can increase exponentially with complexity. This approach of optimizing data movement at the system level is distinct from, but complementary to, other inference optimization frameworks like vLLM or TensorRT-LLM, which primarily focus on computational efficiency through techniques like PagedAttention or CUDA graph fusion. While those systems optimize what happens on the GPU, DualPath optimizes how data gets to the GPU in the first place, a critical distinction for scaling the next wave of multi-turn, long-context AI agents.

DualPath Tackles Agentic Serving Bottlenecks

Get your own daily briefing