Claude Opus 4.6 reimplements toolkit

Epoch AI's MirrorCode benchmark showed Claude Opus 4.6 reimplementing a 16,000‑line bioinformatics toolkit—a task that would normally take humans weeks—demonstrating how large models can reproduce sizable codebases. The result underscores rapid progress in code synthesis for specialized technical domains. (x.com) (x.com)

Most coding benchmarks ask a model to patch one bug or write one function. MirrorCode asks for something closer to rebuilding a whole machine from its outputs: the model can run the original program, read docs, and see tests, but it cannot read the source code or use the internet. (metr.org) That setup is called software reimplementation. It is like handing someone a blender, a recipe book, and a taste test, then asking them to build a new blender that behaves the same way. (metr.org) The program in this case was gotree, a bioinformatics command-line toolkit written in the Go programming language. Gotree is used to manipulate phylogenetic trees, which are the branching diagrams biologists use to represent evolutionary relationships. (github.com, academic.oup.com) Gotree is not a toy. The Gotree and Goalign toolkit paper says the package implements more than 120 commands for sequence alignments and phylogenetic tree operations, and Epoch’s benchmark run focused on a gotree target with about 16,000 lines of code and more than 40 commands. (academic.oup.com, ai-primer.com) Epoch AI and Model Evaluation and Threat Research released preliminary results on April 10, 2026 saying Claude Opus 4.6 fully rebuilt that gotree target. Their estimate was that the same task would take an unassisted human engineer between 2 and 17 weeks. (metr.org, ai-primer.com) The benchmark is built to stop easy shortcuts. The model runs in a Docker sandbox, gets visible tests plus held-out tests, is blocked from wrapping the original binary, and is cut off from the web. (ai-primer.com, github.com) Claude Opus 4.6 was already marketed by Anthropic as a model that can sustain long coding tasks and work inside larger codebases. Anthropic’s February 5, 2026 launch post said the model plans more carefully, debugs its own mistakes better, and can handle a 1 million token context window in beta. (anthropic.com) MirrorCode is a stronger claim than a normal coding demo because it tests behavior, not style. The model does not need to copy the original code line by line; it needs to produce a new codebase that passes the same behavioral checks as the original tool. (metr.org, ai-primer.com) There are still caveats. Epoch says MirrorCode uses an oracle-style specification, memorization defenses are imperfect, and the hardest target in the current suite, a language tool called Pkl, was still unfinished when a run hit a 1 billion token budget. (ai-primer.com, github.com) So this does not mean a model can replace a software team on any random project. It does mean at least one frontier model can now reconstruct a specialized technical tool, at meaningful size, from black-box access alone. (metr.org, anthropic.com) The part to watch next is breadth. Epoch says the full MirrorCode benchmark has more than 20 targets spanning Unix utilities, data tools, interpreters, cryptography, compression, and bioinformatics, with a private test set held back for future evaluation. (ai-primer.com, github.com)

Claude Opus 4.6 reimplements toolkit

Get your own daily briefing