AI reimplemented a bioinformatics toolkit
An Epoch AI benchmark showed Claude Opus 4.6 reimplementing a 16,000-line bioinformatics toolkit (gotree), a task estimated to take a human engineer two to seventeen weeks. The result was presented as an example of how large models can accelerate software engineering work in computational biology. (x.com, x.com)
Bioinformatics is the software side of biology: researchers use command-line tools to clean, compare, and analyze DNA and evolutionary data. In a new benchmark result published April 10, Epoch AI said Anthropic’s Claude Opus 4.6 reimplemented gotree, a phylogenetics toolkit, without being given the original source code. (metr.org) Gotree is a Go-language package for manipulating phylogenetic trees, the branching diagrams biologists use to represent evolutionary relationships. Its GitHub repository describes it as a set of command-line tools and an application programming interface for tree formats including Newick, Nexus, PhyloXML, and Nextstrain. (github.com) Epoch AI and Model Evaluation and Threat Research, or METR, built the test as part of MirrorCode, a benchmark for software reimplementation. In this setup, the model could run the original program, read high-level documentation, and see visible tests, but it could not read the source code or use the internet. (metr.org, ai-primer.com) Epoch AI’s preliminary writeup said gotree is about 16,000 lines of Go code with more than 40 commands. The group estimated that rebuilding it from scratch would take an unassisted human engineer between two and seventeen weeks. (ai-primer.com) The result lands as labs and software companies are trying to measure whether newer models can stay on a coding task for hours or days instead of just fixing short bugs. Anthropic says Claude Opus 4.6, released February 5, 2026, is built for larger codebases and longer-running software tasks, with a one million token context window in beta. (anthropic.com, anthropic.com) MirrorCode is not a normal programming interview. METR said the benchmark measures whether a model can match an existing program’s behavior against an oracle-style specification, and Epoch’s writeup said memorization defenses are still imperfect. (metr.org, ai-primer.com) That caveat matters for biology software because many research tools are judged by whether they reproduce exact outputs on known file formats and test cases. Gotree itself is designed to slot into automated workflows, where each command writes output that can be piped into the next step. (github.com) Epoch’s writeup said Claude Opus 4.6 solved almost every target up to gotree’s size in the current suite, while a harder target called Pkl was still improving when the run hit a one billion token limit. The same writeup said the team explored budgets up to about $550 per task in its setup. (ai-primer.com) METR has separately been tracking “time horizons,” its estimate of how long a human task can be before an artificial intelligence agent’s success rate drops to a chosen level. Its March 3, 2026 update said that metric is based on more than 100 software tasks, which puts the gotree result into a larger push to measure longer autonomous work. (metr.org) For now, the gotree run is one data point: a frontier model reproduced a real bioinformatics utility closely enough to clear MirrorCode’s tests. The next question is whether the same systems can do comparable work on messier software jobs where the target is not already defined by an existing program. (metr.org, ai-primer.com)