New Paper Proposes 'DualPath' to Double LLM Agent Throughput
A new research paper from DeepSeek details a method called DualPath that can double the throughput for multi-turn LLM agents. The technique addresses the I/O bottleneck of the KV-cache by dual-loading cache paths. This architectural innovation aims to improve inference efficiency for agentic systems that require maintaining conversational context over multiple interactions.
The Key-Value (KV) cache is a critical optimization in Transformer models, storing the keys and values from previous tokens to avoid redundant computation during the generation of new tokens. However, for agentic AI that engages in multi-turn conversations, this cache can grow to be enormous, making the process of loading it from storage into GPU memory a significant I/O bottleneck that leaves expensive compute units idle. DeepSeek's DualPath directly tackles this I/O wall by creating a secondary, parallel path for loading the KV-cache. Traditionally, only the "prefill" engine, which processes the initial prompt, would pull this data from storage. DualPath allows the "decode" engine, responsible for generating subsequent tokens, to also load the cache using its idle network bandwidth, effectively doubling the read capacity. This architectural change routes the newly loaded cache from the decoding engine to the prefill engine over a high-speed RDMA network, creating a globally pooled and dynamically balanced storage bandwidth system. Tests on a 660B parameter model showed this method increases offline inference throughput by up to 1.87 times and boosts the average throughput for online services by 1.96 times. The paper is a joint effort from researchers at DeepSeek, Peking University, and Tsinghua University, with Peking University doctoral student Wu Yongtong credited as the first author. This collaboration highlights a trend of top Chinese universities and AI firms partnering to solve fundamental infrastructure challenges in deploying large-scale AI systems. This focus on system-level optimization is crucial as models are increasingly used for complex, stateful tasks that go far beyond single-shot text generation.