Claude Opus 4.5 matches expert peer review
- FrontierNewsAI on May 21 pointed to a new arXiv study saying Claude Opus 4.5 matched strong human scientific reviewers on expert-rated critiques. - The study had 45 scientists spend 469 hours rating 2,960 criticisms from reviews of 82 Nature-family papers across three research areas. - The paper, submitted to arXiv on May 20, 2026, compares Claude Opus 4.5, GPT-5.2 and Gemini 3.0 Pro.
A new arXiv paper submitted on May 20 says AI systems can perform at or near the level of human scientific reviewers on a large expert-rated benchmark of peer-review critiques. The study examined 2,960 individual criticisms drawn from reviews of 82 Nature-family papers and had 45 domain scientists spend 469 hours rating them for correctness, significance and sufficiency of evidence. FrontierNewsAI highlighted the result on May 21, focusing on Claude Opus 4.5 as one of the models in the comparison. The paper’s abstract says all three AI reviewers tested — including Claude Opus 4.5 — outperformed the lowest-rated human reviewer across every dimension, while a GPT-5.2-based reviewer scored above each paper’s top-rated human reviewer on a composite measure. ### Where did the “2,960 criticisms in 469 hours” claim come from? The numbers come directly from the paper “On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists,” posted to arXiv on May 20. The abstract says 45 scientists in physical, biological and health sciences rated 2,960 individual criticisms over 469 hours. Those criticisms were extracted from both human-written and AI-generated reviews of 82 papers. (arxiv.org) The paper frames each criticism as a review comment targeting one specific aspect of a manuscript rather than treating a whole review as a single unit. That design matters because it measures whether a critique is correct, important and supported, not just whether an overall recommendation matches a journal decision. ### Did the study actually say Claude Opus 4.5 “matched” top experts? (arxiv.org) The abstract does not say Claude Opus 4.5 alone beat the best human reviewer overall. It says a GPT-5.2-powered reviewing agent scored above each paper’s top-rated human reviewer on the composite metric, at 60.0% versus 48.2%, with a reported p-value of 0.009. It also says all three AI reviewers tested, including Gemini 3.0 Pro and Claude Opus 4.5, exceeded the lowest-rated human reviewer across every dimension. FrontierNewsAI’s phrasing that Claude Opus 4.5 “matched top human experts” is broader than the wording visible in the abstract. Based on the paper text available through arXiv search results, the verified claim is that Claude Opus 4.5 was one of three AI reviewers in the study and that the group of AI reviewers cleared the lowest-rated human benchmark on all measured dimensions. (arxiv.org) ### What else did the researchers find about AI peer review? The abstract says AI reviewers surfaced a distinct 26% of issues that no human reviewer raised. It also says AI reviewers’ accurate criticisms were more likely to be rated significant and well-evidenced. At the same time, the paper reports that AI reviewers overlapped with one another much more than humans did — 21% versus 3% for cross-reviewer pairs — and showed 16 recurring weaknesses not shared by humans. (arxiv.org) Those weaknesses included limited subfield knowledge, difficulty managing long context across multiple files, and what the abstract calls an overly critical stance on minor issues. The paper presents the results as evidence of both capability and limits rather than a simple replacement case for human referees. ### What is Claude Opus 4.5 in this comparison? (arxiv.org) Anthropic introduced Claude Opus 4.5 on Nov. 24, 2025, describing it as its newest model and saying it was available through its apps, API and major cloud platforms. Anthropic’s launch post describes Opus 4.5 as a model for coding, agents and computer use, and says it is also stronger on research-style tasks. In the new peer-review paper, Claude Opus 4.5 appears as one of the evaluated AI reviewers rather than the sole headline winner. The next place to look is the full paper on arXiv, which was submitted on May 20 and lists Seungone Kim, Dongkeun Yoon and Graham Neubig among its authors. (arxiv.org) (anthropic.com)