Meta‑Harness token cut

A public post says Meta‑Harness auto‑optimizes LLM scaffolding and outperformed baselines by 7.7 points while using four times fewer tokens, positioning the approach as a token‑efficiency improvement for scaffolded prompts (x.com). The write‑up emphasizes fewer tokens and higher scores across the evaluated tasks, suggesting practical cost savings for multi‑step LLM workflows (x.com).

Large language models do not work alone; the surrounding code decides what the model sees, remembers, and retrieves. A new Stanford-led paper says its “Meta-Harness” system can rewrite that scaffolding to score higher while using fewer tokens. (arxiv.org) The paper, submitted to arXiv on March 30, 2026, is by Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. It defines a “harness” as the code that determines what information to store, retrieve, and present to a large language model. (arxiv.org) Meta-Harness searches over that harness code in an outer loop: it proposes a new version, evaluates it, stores the source code and execution traces, and tries again. The authors say the proposer reads prior candidates through a filesystem instead of relying on short summaries or a single score. (arxiv.org; yoonholee.com) In plain terms, the system is tuning the workflow around the model rather than retraining the model itself. That workflow includes prompt construction, context management, retrieval, and tool use, which are often still adjusted by hand in production systems. (arxiv.org; github.com) The headline result came on online text classification, where the paper says Meta-Harness beat a state-of-the-art context-management baseline by 7.7 points while using 4 times fewer context tokens. In retrieval-augmented math reasoning, the paper reports a 4.7-point average accuracy gain on 200 International Mathematical Olympiad-level problems across five held-out models. (arxiv.org) The paper also reports gains in agentic coding, where discovered harnesses outperformed the best hand-engineered baselines on TerminalBench-2. On the project page, the authors show one illustrative search run rising from 28.5% to 46.5% on a hard 19-task subset by iteration 7. (arxiv.org; yoonholee.com) The token claim is central because tokens are the units models read and bill on. If a harness can deliver better results with less context, the same workflow can become cheaper to run and easier to fit inside context limits. (arxiv.org) The authors argue that prior optimizers often hide too much information by compressing feedback into summaries. Their project page says Meta-Harness instead gives the coding agent access to raw logs, source code, and scores from all prior runs, with up to 10 million tokens of diagnostic context per step, compared with at most 26,000 for surveyed earlier methods. (yoonholee.com) That design choice points at a broader shift in artificial intelligence tooling: performance gains are coming not only from bigger base models, but from better orchestration around fixed models. The paper’s premise is that two systems using the same model weights can perform differently because the surrounding harness makes different decisions about memory, retrieval, and tool calls. (arxiv.org) The code release landed publicly on GitHub this week as a framework with two reference experiments from the paper: text classification and Terminal-Bench 2.0. The repository says it is a reusable framework for applying Meta-Harness to new domains, which will let outside researchers test whether the reported gains hold up beyond the paper’s benchmarks. (github.com) For now, the results are a research claim tied to a fresh arXiv paper and an initial code release. But the basic pitch is straightforward: spend less context, inspect more traces, and let the scaffold around the model improve itself. (arxiv.org; github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.