Anthropic's Claude solves 30%

- Anthropic said on April 29 that Claude’s latest research models cleared a new 99-question bioinformatics benchmark built from real biological datasets. - The standout result was on 23 “human-difficult” problems: Claude Mythos Preview solved about 30%, while panels of up to five experts could not. - That matters because BioMysteryBench tests actual data analysis, not textbook recall—closer to how genomics and lab-computation work. (anthropic.com)

Bioinformatics is the part of biology where the work happens inside messy datasets instead of clean textbook diagrams. That matters because modern biology runs on sequencing files, expression matrices, protein measurements, and metadata that rarely line up neatly. The gap has been obvious for a while — language models can ace exams, but real biological analysis is noisy, tool-heavy, and full of dead ends. On April 29, Anthropic said its new Claude results on BioMysteryBench suggest that gap is starting to close, at least a little. ### What is BioMysteryBench? It’s Anthropic’s new benchmark for bioinformatics research work. Not multiple choice. Not trivia. It uses 99 expert-written questions built around real biological datasets — things like DNA and RNA sequencing, proteomics, metabolomics, and related metadata — and asks for a verifiable answer, not a preferred method. The point is to test whether a model can get through the same kind of ugly, open-ended analysis a scientist actually faces. ### Why is that harder than normal AI benchmarks? Most famous benchmarks mostly test knowledge and reasoning in a chat window. Biology research usually needs more than that. A model has to inspect data files, use databases, run tools, write code, and decide what to try next when the first idea fails. Basically, this is closer to handing someone a strange lab notebook and a hard drive than asking them an exam question. Anthropic split the 99 problems into two rough buckets. On 76 problems that human experts could solve, recent Claude generations performed around expert level. The attention-grabber was the other 23 problems — cases that stumped panels of up to five domain experts. On those, Claude Mythos Preview solved roughly 30%. That is the number people are reacting to. ### Why is the 30% number a big deal? Because the comparison is not “Claude versus an average student.” It is Claude versus specialists looking at real data. And these were not cherry-picked toy tasks where the answer was sitting in the prompt. The benchmark was designed so answers could be checked against objective properties of the data or outside validation, like known metadata or assay-confirmed results. So when the model gets one right, the claim is supposed to cash out in the data, not in vibes. ### Does this mean Claude is better than biologists? No — that is the wrong read. The result means models are becoming useful on slices of biological analysis that used to look firmly expert-only. But the benchmark itself is narrow. It measures answer quality on curated problems, not whether a model can run an actual research program, notice contamination in a wet lab, or tell when a dataset is broken in some novel way. Anthropic also framed the result as rapid improvement across generations, not as “AI solved biology.” ### Why bioinformatics first? Because biology has quietly become a data discipline. Sequencers, microscopes, and mass spec machines all dump out huge files that need interpretation before a human can even decide what happened. That makes the field unusually compatible with AI systems that are good at reading, coding, searching, and stitching together clues from many tools at once. If a model gets genuinely reliable here, it could become a force multiplier for small labs. ### What’s the catch? Reproducibility and trust. A model that solves 30% of expert-stumping tasks is interesting. A model that solves them consistently, explains itself clearly, and fails safely is useful. Those are not the same thing. In science, one flashy hit matters less than whether another lab can rerun the workflow and get the same answer. That’s the bar these systems still have to clear. For computational biologists, it’s that a frontier model just posted a real number on a benchmark built to look more like actual biological research — and that number was high enough to get scientists’ attention. If that trend keeps moving, the near-term change is not autonomous science. It’s better biological copilots.

Anthropic's Claude solves 30%

Get your own daily briefing