Multi-Stream LLMs parallel inputs paper
- Guinan Su, Yanwu Yang, Xueyan Li and Jonas Geiping posted “Multi-Stream LLMs” to arXiv on May 12, proposing parallel token streams for agents. (arxiv.org) - The paper says each forward pass can read multiple input streams and generate multiple output streams at once, instead of one chat sequence. (arxiv.org) - The authors linked code on GitHub, with section-specific training and evaluation scripts for efficiency, security and monitorability experiments. (github.com)
Guinan Su, Yanwu Yang, Xueyan Li and Jonas Geiping have posted a paper that argues large language models should stop treating every interaction as one long chat transcript. In “Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs,” the authors describe instruction-tuning models to read from and write to multiple token streams in the same forward pass. (arxiv.org) The paper was submitted to arXiv on May 12, according to the arXiv record. A linked GitHub repository lays out code for the paper’s efficiency, security and monitorability sections. (github.com) The core claim is architectural, not a new wrapper around an existing chat loop. The paper says current agent systems still inherit a single-stream format in which user messages, tool calls, internal reasoning and model outputs are serialized into one sequence. The authors write that this creates a bottleneck because a model “cannot act while reading” and cannot react to new information while it is still generating output. ### What are the authors actually changing inside the model interface? The paper says the model is trained on “multiple, parallel streams of computation,” with separate channels for different roles rather than one merged transcript. (arxiv.org) In the authors’ description, each forward pass simultaneously consumes several input streams and emits tokens on several output streams, while later steps remain causally dependent on earlier ones. That means user input, model output and intermediate reasoning do not all have to wait in line behind one another. The authors frame that as a way to let an agent read, think and act in overlapping fashion, instead of alternating those stages one after another inside a single text sequence. (arxiv.org) ### How does the code release map to the paper’s claims? The GitHub repository is organized into three experiment blocks. The README lists Section 5 on efficiency, Section 6 on security and Section 7 on monitorability, with each subfolder described as self-contained. (arxiv.org) For efficiency, the repository says it trains Qwen3-1.7B and Qwen3-4B models as parallel-stream systems with two streams for “solving-while-reading” and three streams for “auditing-while-solving.” The listed evaluations include GSM8K, MATH500, LogicNLI, SQuAD, ProofWriter and PubMedQA. (arxiv.org) For security, the README says the authors train Qwen2.5-7B and Qwen3-4B on “multi-stream-reconstructed Alpaca” and evaluate on TensorTrust, Gandalf, Purple, RuLES, StruQ-ID/OOD, NESSiE and IFEval. For monitorability, it lists Stream-8B and Stream-27B models with 10 cognitive streams and evaluations tied to awareness and monitoring tasks. (github.com) ### Why are two or three streams the practical detail to watch first? The repository’s efficiency section gives the clearest near-term picture because it ties the idea to concrete training setups. The code notes describe interleaving streams into a single token sequence, while adding per-stream positional counters, stream embeddings, a stream-causal attention mask and one shared language-model head. (github.com) That matters because the proposal is not presented as a separate ensemble of agents voting after the fact. The implementation notes say the model functions with “complete weight sharing between streams” at inference time, even though some class names retain older “Medusa” labels for historical reasons. (github.com) ### Is this a new model family or a training recipe for existing backbones? The code release points to the second reading. The repository repeatedly names Qwen2.5, Qwen3 and Qwen3.5 backbones, and the paper describes the change as a shift in instruction-tuning and data construction rather than a wholly separate foundation model design. (github.com) The efficiency README also points to a “wait-k” data-construction pipeline and a canonical multi-GPU launcher, suggesting the authors want other researchers to reproduce the training setup rather than only inspect results. (github.com) The top-level README directs users to section-specific READMEs for setup and running the experiments. ### What can people check next? The arXiv entry links the preprint and code for “Multi-Stream LLMs,” and the GitHub repository provides separate folders for Sections 5, 6 and 7. Researchers who want to test the claim can start with the efficiency scripts for Qwen3-1.7B and Qwen3-4B, then move to the security and monitorability setups described in the repository documentation. (github.com) (arxiv.org)