Harnesses for agent orchestration

A social thread proposes putting agent orchestration into editable natural-language harnesses—essentially moving control out of opaque code and into human-reviewable artifacts—and reports large benchmark gains in multi-step tasks (a +47% success lift on benchmarks like OSWorld and SWE‑bench in ablation tests) (x.com). The idea is to make orchestration explicit and versionable so teams can debug, migrate, and audit agent behavior more reliably (x.com).

An agent harness is the control layer around a model, and a new March 2026 paper argues that layer can be written in editable natural language instead of buried in code. (arxiv.org) The paper, “Natural-Language Agent Harnesses,” was posted to arXiv on March 26, 2026 by Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, and Hai-Tao Zheng. It describes “Natural-Language Agent Harnesses” and an “Intelligent Harness Runtime” that executes those text instructions through contracts, artifacts, and adapters. (arxiv.org) In plain terms, the model is the brain and the harness is the operating procedure: it decides when to call tools, how to store state, when to retry, and how to hand work to another step. OpenAI’s January 23, 2026 explanation of the Codex “agent loop” and Anthropic’s January 9, 2026 note on evals both describe that orchestration layer as central to how an agent actually works. (openai.com) (anthropic.com) The change in this paper is where that control logic lives. Instead of hard-coding orchestration rules in a controller, the authors say teams can externalize them into a portable text artifact that people can read, edit, compare, and version. (arxiv.org) That arrives as major labs are publishing more about harness design than about prompts alone. Anthropic wrote on March 24, 2026 that harness design is “key to performance” for long-running application development, and OpenAI wrote on February 11, 2026 that engineers increasingly spend time designing environments, scaffolding, and feedback loops for agents. (anthropic.com) (openai.com) The benchmarks in this debate are not toy tasks. OSWorld measures whether multimodal agents can complete 369 open-ended computer tasks across web apps, desktop apps, file operations, and multi-application workflows, while SWE-bench tests whether models can fix real GitHub issues by generating code patches. (os-world.github.io) (swebench.com) SWE-bench Verified narrows that coding test to 500 human-validated issues that annotators reviewed for clarity, correctness, and solvability. The site says the leaderboard includes everything from simple agent loops to more elaborate review and multi-rollout systems, which makes harness design part of the result, not just model quality. (swebench.com) The paper says it ran controlled evaluations across coding and computer-use benchmarks, including module ablations and migrations from code harnesses to text harnesses. The social posts tied to the paper reported a 47 percent success lift in ablation tests on tasks including OSWorld and SWE-bench, but that figure appears in the posts rather than the arXiv abstract, and the paper is still listed as under review. (arxiv.org) (x.com) Anthropic’s recent engineering notes point to the same practical problems these harnesses are trying to solve: models lose coherence on long tasks, context windows fill up, and agents need structured handoffs between sessions. Its March 24 post says a three-agent setup of planner, generator, and evaluator improved multi-hour coding sessions by breaking work into chunks and carrying forward structured artifacts. (anthropic.com) OpenAI’s Codex posts frame the problem similarly from the software side. The company says the core agent loop orchestrates user input, model calls, tool execution, and repeated retries, and that a small team used Codex to generate roughly a million lines of code over about five months while humans focused on system design rather than manual coding. (openai.com 1) (openai.com 2) The immediate test for natural-language harnesses is whether teams trust text files to carry rules that were once hidden in code. The paper’s bet is that making orchestration visible and editable turns agent behavior from a black box into something closer to a spec. (arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.