New ML papers cluster

- Several fresh research papers surfaced this week on prefill caching, looped-LLM scaling, and latent reasoning. - Notable titles include 'Prefill-as-a-Service', 'Parcae' scaling laws, and 'Scaling Latent Reasoning'. - The stream of papers suggests rapid iteration on inference infrastructure and reasoning techniques across research groups. ( )

A cluster of machine-learning papers published in April is converging on the same idea: spend compute more carefully, whether on serving models or on making them reason longer. (arxiv.org 1) (arxiv.org 2) Large language models do two expensive jobs for every prompt. They first read the prompt and build an internal memory called a key-value cache, then they generate tokens one by one using that cache. (arxiv.org) One April 16 paper, “Prefill-as-a-Service,” argues those two jobs no longer need to stay in the same data center for newer hybrid-attention models with smaller caches. The authors, from Tsinghua University and Moonshot AI, report 54% higher serving throughput than a homogeneous prefill-decode setup and 32% above a naive heterogeneous baseline in a case study on an internal 1 trillion-parameter model. (arxiv.org) A second paper, “Parcae,” posted April 14 by researchers at University of California San Diego and Together AI, focuses on model design rather than serving. It studies looped language models, which reuse the same block of layers multiple times instead of adding entirely new layers and parameters. (arxiv.org) That reuse is meant to buy extra depth without the full memory cost of a larger model, but the paper says earlier training recipes were unstable and could produce loss spikes. Parcae reports up to 6.3% lower validation perplexity than prior large-scale looped models and says a 1.3 billion-parameter version reached up to 87.5% of the quality of a Transformer twice its size under a fixed parameter and data budget. (arxiv.org) A related line of work asks whether a model can “think” in hidden states instead of writing out every intermediate step as text. In “Scaling Latent Reasoning via Looped Language Models,” first posted on October 29, 2025 and revised on November 17, 2025, the authors present Ouro, an open-source family of looped models trained with iterative computation in latent space and scaled to 7.7 trillion tokens. (arxiv.org) That paper says Ouro models with 1.4 billion and 2.6 billion parameters matched results of state-of-the-art models up to 12 billion parameters on a range of benchmarks. The authors include Yoshua Bengio and Jason Eshraghian, and they argue the gain comes from stronger “knowledge manipulation” rather than simply storing more facts. (arxiv.org) The looped-model push did not start this month. An ICLR 2025 poster paper, “Reasoning with Latent Thoughts: On the Power of Looped Transformers,” argued that many reasoning problems need more depth rather than more parameters, and that a k-layer model looped L times could nearly match a kL-layer non-looped model on synthetic tasks such as addition and p-hop induction. (openreview.net) (arxiv.org) Taken together, the April papers point in two directions at once: cheaper inference systems on the outside and more iterative computation inside the model. The common bet is that the next gains may come less from adding raw parameter count and more from deciding where, when, and how often the same computation runs. (arxiv.org 1) (arxiv.org 2) (arxiv.org 3)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.