ArXiv paper proposes stateful serving

- Victor Norgren posted a May 13 arXiv paper proposing stateful transformer serving that keeps KV caches alive across sessions and updates them continuously. (arxiv.org) - The paper reports a roughly 43-millisecond standard-query path and says its reference implementation was 2.4x to 5.9x faster than four engines. (arxiv.org) - The preprint is available as arXiv:2605.13784, and Norgren lists LayerScale, Inc. as the affiliation. (arxiv.org)

Victor Norgren, who lists LayerScale, Inc. as his affiliation, posted an arXiv paper on May 13 that argues large language model serving should shift from request-by-request execution to persistent, stateful sessions. (arxiv.org) The paper, titled “Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers,” describes a system that keeps a session’s KV cache alive and updates it as new data arrives, instead of rebuilding attention state on every query. (arxiv.org) Norgren says that design separates a data plane, which ingests updates asynchronously, from a query plane, which reads precomputed state when a user asks a question. The paper is a preprint on arXiv, not a peer-reviewed conference publication. The proposal lands in a part of the stack where current inference engines already try to avoid redundant work through prefix caching and KV reuse. vLLM documents “automatic prefix caching,” SGLang says its runtime is built around RadixAttention and prefix caching, and Nvidia’s TensorRT-LLM documents KV cache reuse for requests that begin with the same prompt. llama.cpp also exposes prompt-cache options for faster startup on repeated prompts. ### What is the paper changing in the serving model? (arxiv.org) The May 13 paper says conventional transformer inference is request-driven and pays an O(n) prefill cost on every query. Norgren proposes “stateful sessions” in which the KV cache persists across a session and is advanced incrementally as new data arrives. That means the system processes the stream once, then lets later questions read from the accumulated state instead of replaying the whole context. The abstract says the resulting query complexity is O(|q|) in query length, independent of accumulated context size. (docs.vllm.ai) Norgren also describes “Flash Queries,” which pre-compute answers to registered questions after each data update and can push latency from the roughly 43-millisecond standard-query path “toward zero” for predictable queries by using idle GPU cycles and server-sent events. ### How is that different from ordinary KV caching? vLLM, SGLang and TensorRT-LLM all describe mechanisms that reuse KV state when requests share a prefix. (arxiv.org) Those systems still begin from a request boundary: a new prompt arrives, the engine looks for reusable prefix blocks or pages, and then computes the rest. Norgren’s paper describes a different unit of work. Instead of treating each question as the event that triggers prefill, the system treats incoming data itself as the event that advances model state. (arxiv.org) In the paper’s framing, the context is persistent and the query becomes a lightweight read over already-updated attention state. ### What numbers does the paper report? The paper says its reference implementation achieved a 2.4x to 5.9x speedup over vLLM, SGLang, TensorRT-LLM and llama.cpp on streaming benchmarks. It also says the same setup was 22x to 92x faster than cloud APIs including GPT-5.2, Claude Haiku and Claude Opus 4.5 in the reported tests. (docs.vllm.ai) A roughly 43-millisecond standard-query path is the headline latency figure in the abstract. The comparison in the paper is tied to a reference implementation and benchmark conditions described by the author, not an independently replicated industry benchmark. (arxiv.org) ### Which workloads is this aimed at? The introduction uses financial analysis as the lead example. Norgren writes that trading systems continuously receive price updates and need to answer analytical questions such as current trend with minimal latency. (arxiv.org) The same design would also fit other continuously updated inputs such as logs, telemetry or sensor streams, because the claimed advantage comes from ingesting new data incrementally rather than replaying a long prompt each time. That workload fit is an inference from the paper’s session model and examples, not a separate benchmark category reported by the author. (arxiv.org) ### What should readers watch next? ArXiv lists the submission as arXiv:2605.13784, filed on May 13, 2026, with Victor Norgren as the sole named author. The next concrete marker will be whether the paper receives revisions, code release details, outside reproductions, or a conference submission tied to the reference implementation. (arxiv.org 1) (arxiv.org 2)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.