'DualPath' Aims to Speed Up Agentic AI
A new technique called "DualPath" targets the KV-cache bandwidth bottleneck that slows down agentic LLM inference. By using a GPU-to-GPU relay system, the method reportedly boosts throughput by up to 1.87x, a significant gain for complex, multi-turn AI agents.
The KV-cache, which stores key and value states to prevent redundant computations in transformers, becomes a significant memory-bandwidth problem during long-running agentic workloads. As conversational context grows over hundreds of turns, the size of this cache can exhaust GPU memory, creating a bottleneck that has more to do with data movement than raw compute power. For some models, a single 128K context prompt can consume 40GB of high-bandwidth memory (HBM) for the KV-cache alone. Traditional inference systems assign all KV-cache read operations to "prefill" engines, which initially process the prompt. However, in agentic scenarios with over 95% cache reuse, these prefill engines' storage interfaces become saturated, while the "decode" engines, responsible for generating new tokens, have substantial unused I/O capacity. This architectural design creates a choke point, proving that the bottleneck is often an artifact of suboptimal resource allocation, not an inherent limitation. DualPath addresses this by creating a second, parallel path for loading the KV-cache. The classic path remains, loading from storage directly into the prefill engines. The new, second path loads the cache into the decode engines and then transmits it to the prefill engines using RDMA (Remote Direct Memory Access) over the high-bandwidth compute network. This approach effectively aggregates the storage interface bandwidth from all engines in the system. To manage this dual flow, the system employs a workload-aware scheduler and a traffic manager. The scheduler dynamically chooses the best path based on current load and manages compute quotas to prevent delays. The traffic manager uses virtual lanes to ensure that KV-cache transfers are relegated to lower-priority channels, preventing interference with latency-sensitive model operations like all-to-all collectives. This architectural shift allows the system to be limited by total storage capacity rather than by I/O bottlenecks. The technique is implemented on top of modern inference stacks that already separate prompt processing (prefill) and token generation (decode) for efficiency. In production deployments and evaluations on realistic agentic workloads, DualPath has demonstrated significant performance gains. It has been shown to increase end-to-end throughput by up to 1.87x for offline inference and improve online serving throughput by an average of 1.96x without violating service level objectives.