ProgramBench finds tool recreation at 0%

- ProgramBench launched on May 5 with 200 cleanroom coding tasks that ask agents to rebuild real tools from binaries and docs alone. - Across 9 tested models, none fully solved a task; the best passed 95% of tests on just 3% of tasks. - That matters because SWE-bench-style patching is improving, but whole-program reconstruction still breaks today’s strongest coding agents.

ProgramBench is a new coding benchmark, but it is really a stress test for a very specific kind of software work — rebuilding a program when you can see what it does, not how it was written. The benchmark went live on May 5, alongside an arXiv paper and public code release from researchers behind the broader SWE-bench ecosystem. The headline number is brutal: across 9 evaluated language models, none fully resolved a single task. (arxiv.org) ### What is ProgramBench actually testing? The setup is cleanroom reconstruction. A model gets a compiled executable and its documentation, then has to design and implement a fresh codebase that behaves the same way. No source lookup. No internet. No shortcut where the model just patches an existing repo. Hidden behavioral tests check whether t(arxiv.org) ### Why is that harder than normal coding benchmarks? Most coding benchmarks are narrower. HumanEval asks for a function. SWE-bench asks for a patch inside a known repository. ProgramBench asks for architecture, implementation, build setup, and edge-case handling all at once. Basically, the model has to infer the shape of the whole machine from(arxiv.org)ough to survive fuzzed end-to-end tests. (arxiv.org) ### What did the benchmark include? The dataset has 200 tasks spanning 6 programming languages — 107 Rust, 46 Go, 33 C, 12 C++, 1 Haskell, and 1 Java. The tasks range from small CLI utilities up to widely used software including FFmpeg, SQLite, and the PHP interpreter. Difficulty is skewed toward the middle, with 120 medium tasks, 27 easy, 18 hard, and 35 unrated. (arxiv.org) ### So what does “0% resolved” mean here? It means exactly what it sounds like: no evaluated model produced a candidate that fully passed the hidden tests on any task. The paper’s softer metric shows a little movement — the best model managed to pass 95% of tests on only 3% of tasks — but that is still nowhere near “rebuilt the program.” That g(arxiv.org) while still failing on the weird cases people depend on. (arxiv.org) ### What kind of failure pattern showed up? The paper says models tended to write monolithic, single-file implementations that looked nothing like the human-written originals. That is a useful clue. Models can often generate plausible local code, but they still struggle with decomposing a large tool into the right subsystems before they start t(arxiv.org)onvincing demo and rebuilding a watch from the ticking sound. (arxiv.org) ### Why does this matter if other coding scores are rising? Because it exposes a blind spot in the current benchmark mix. SWE-bench and related tests measure issue resolution inside existing codebases, and those leaderboards have shown real progress. But that does not automatically transfer to closed-box engineering tasks where the source disap(arxiv.org)itself. ProgramBench is basically asking whether today’s coding agents can replace a legacy internal tool from behavior alone. Right now, the answer is no. (github.com) ### Is this reverse engineering? Not exactly. The benchmark frames itself as reconstruction from behavior and documentation, not decompilation. The goal is not to recover the original source line by line. The goal is to ship a new implementation that is behaviorally equivalent enough to pass. That makes the result more relevant to softwa(github.com)nary analysis. (benchlm.ai) ### Bottom line? ProgramBench does not say frontier models are bad at coding. It says they are still weak at holistic software recreation — the kind where architecture, exploration, and exact behavioral fidelity matter more than filling in code inside an existing frame. That is a different bar, and right now it is still standing. (arxiv.org)

ProgramBench finds tool recreation at 0%

Get your own daily briefing