Motus optimises agents from traces

Motus, an open‑source agent infrastructure, claims it continually learns from production traces—failures, latency and cost—to optimise harnesses, models and workflows and reports higher accuracy at much lower cost. The author shared benchmark results showing roughly 2.3x lower cost and 52% lower latency on Terminal‑Bench and SWE‑bench compared with frontier models. (x.com/JiaZhihao/status/2044055163975909435)

Artificial intelligence agents are starting to learn from their own production logs, not just from pretraining data, and Motus is pitching that process as open-source infrastructure. (github.com, lithosai.com) Motus is a new open-source project from LithosAI that serves and deploys agents built with plain Python, the OpenAI Agents Software Development Kit, Anthropic’s Software Development Kit, and Google’s Agent Development Kit. Its GitHub repository says the runtime turns Python code into “parallel, resilient workflows” and includes tools, memory, guardrails, tracing, and cloud deployment commands. (github.com, github.com) The company says Motus pulls three signals from production traces — task outcomes, latency, and cost — and uses them to change two parts of an agent system: the harness, which is the code and instructions around the model, and the model mix, which decides which model handles each step. LithosAI says those learned changes carry over when developers swap in a new model later. (lithosai.com, langchain.com) A harness is the scaffolding around a model: the loop that decides when to call tools, what instructions to keep fixed, and how to recover from errors. LangChain’s Harrison Chase wrote on April 5, 2026 that agent systems can improve at the model layer, the harness layer, and the context layer, with traces as the core feedback signal. (langchain.com) LithosAI’s public site ties that idea to two coding benchmarks. On Terminal-Bench 2.0, it says a setup starting from Claude Opus 4.6 at 64% accuracy rose to 77.5% after harness optimization and to 80.1% after model orchestration, while cutting cost per task to 2.4 times lower than Opus 4.6 alone. (lithosai.com) On SWE-bench Verified, LithosAI says Claude Opus 4.6 reached 75.8% and GPT-5.3-Codex reached 72.6%, while Motus reached 79% at 2.3 times lower cost than Opus alone. The company also says its setup reduced latency by 52%, a figure shared by founder Jia Zhihao in a post on X. (lithosai.com, x.com) Those benchmarks measure long, multi-step coding work rather than single-shot code generation. Terminal-Bench describes itself as a benchmark for agents working in terminal environments, and SWE-bench evaluates whether a model can generate a patch that resolves a real GitHub issue inside a Docker-based test setup. (tbench.ai, github.com) The numbers are company-reported, and benchmark results in agent systems are under heavier scrutiny this year. Researchers and developers have spent the past several months arguing that some agent benchmarks can be gamed through test leakage, weak evaluators, or benchmark-specific tricks rather than genuine task completion. (agent-wars.com, labs.scale.com) Motus is arriving as developers try to cut the cost of using frontier models for every step of a coding workflow. LithosAI’s pitch is that production traces can tell an agent when to use a cheaper model, when to change the workflow itself, and when the fixed harness is the real bottleneck. (lithosai.com, langchain.com) For now, the clearest public evidence is the project’s code, its product pages, and the benchmark charts LithosAI has published. The next test is whether outside users can reproduce the gains on their own agents, with their own traces, under public evaluation rules. (github.com, lithosai.com, swebench.com)

Motus optimises agents from traces

Get your own daily briefing