AutoTTS agentic test-time scaling paper

- Researchers posted the AutoTTS paper on arXiv on May 8, describing an agentic framework that searches for inference-time control strategies instead of hand-crafting them. - The paper says one full discovery run cost $39.9, took 160 minutes, and found a controller that cut tokens by 69.5% versus SC@64. - The paper, code repository and project page are public now; X discussion also bundled separate NeuOS v4 and “Soul Vectors” claims.

Researchers posted a new paper this month arguing that large language models can be used to design better inference-time reasoning policies for other models, rather than relying on manually tuned prompting heuristics. The paper, “LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling,” appeared on arXiv on May 8 and was revised on May 12. Its authors — Tong Zheng and 12 co-authors spanning UMD, UVA, WUSTL, UNC, Google and Meta, according to the project materials — call the framework AutoTTS. They describe it as a way to search automatically for how a model should spend extra compute during inference. ### What is AutoTTS actually trying to automate? Test-time scaling refers to spending more compute after a prompt arrives — for example by generating more reasoning paths, probing intermediate states, or deciding whether to stop or keep going. The AutoTTS paper says most of those strategies are still hand-built by researchers, who choose branching and stopping rules by intuition. AutoTTS shifts that work into a search process: humans define an environment, and a coding agent proposes and revises controllers inside it. (arxiv.org) The GitHub repository says those controllers are code-defined policies operating over an offline replay environment, not gradient updates to the base model. In the setup described by the authors, the controller can branch, continue, probe, prune or answer, using pre-collected reasoning trajectories and probe signals rather than repeated live model calls during evaluation. ### How does the paper say the system works at inference time? (arxiv.org) The arXiv abstract says the key design choice is the environment itself: it has to make the search space manageable and provide cheap, frequent feedback. The paper’s concrete instantiation treats width-depth allocation as a controller-synthesis problem, where the controller decides when to open new branches, extend existing ones, inspect them, abandon them or stop entirely. (arxiv.org) The repository says the search is parameterized by a scalar beta value that schedules internal hyperparameters, and candidate controllers are replay-evaluated on cached traces. The authors say they also added execution-trace feedback so the coding agent can diagnose why a candidate program failed and revise it in later rounds. ### What evidence do the authors provide? The paper reports results on mathematical reasoning benchmarks and says the discovered strategies improved the accuracy-cost tradeoff against manually designed baselines. (arxiv.org) The arXiv abstract says the strategies also generalized to held-out benchmarks and model scales. The GitHub README highlights one headline result: about 69.5% token savings versus SC@64 at beta near 0.5, while held-out average accuracy matched SC@64 across four backbone scales. (arxiv.org) It also says a full discovery run cost an estimated $39.9 and took 160 minutes, with zero LLM calls during discovery evaluation because cached segments were replayed offline. ### What is the “Confidence Momentum Controller” the repo mentions? (arxiv.org) The repository identifies the main discovered policy as the Confidence Momentum Controller, or CMC. It describes that controller as combining trend-based stopping, linked width-depth control, alignment-aware depth allocation and conservative branch abandonment. Those are the paper’s implementation details for how AutoTTS’s search output is turned into an inference policy. (github.com) The project materials frame that result as a shift from designing single heuristics to designing the environment in which heuristics are discovered. That is the central claim in the paper and repo, not that the base model retrains itself during deployment. ### Where do the NeuOS v4 and “Soul Vectors” claims fit in? X discussion this week grouped the AutoTTS paper with separate posts about reverse-engineered “NeuOS v4” architecture claims and “Soul Vectors.” The source material provided for this story links those topics through social-media chatter, but I could not verify a primary technical paper or official documentation tying those claims to the AutoTTS authors or the arXiv submission. (github.com) The AutoTTS paper and repository themselves do not describe “Soul Vectors,” and the “NeuOS” term surfaced in unrelated older academic material in web results, not as a confirmed part of this paper. (arxiv.org) As of May 21, the verifiable record is the arXiv paper, the public GitHub repository and the project page for AutoTTS. The next visible step is whether the authors release fuller reproduction materials and whether outside researchers confirm the reported accuracy-cost gains on the published benchmarks. (arxiv.org)

AutoTTS agentic test-time scaling paper

Get your own daily briefing