Humans still outperform agents
Nature reported that human scientists continue to outperform the best AI agents on complex scientific tasks, underscoring that judgement and experimental design remain human strengths. The finding reinforces emphasis on hiring researchers who can design experiments, interpret ambiguous results, and recognise spurious signals beyond raw model operation. (nature.com)
Scientists still beat the best artificial-intelligence agents on hard research problems that require planning experiments, reading messy results and deciding what to try next. (nature.com) Nature reported on April 13 that the result comes from the 2026 Artificial Intelligence Index, an annual report from Stanford University’s Institute for Human-Centered Artificial Intelligence. The report added a standalone science chapter this year. (nature.com) (hai.stanford.edu) The underlying test is called AIRS-Bench, short for Artificial Intelligence Research Science Benchmark. It uses 20 tasks drawn from recent machine-learning papers and asks agents to handle the full workflow, including idea generation, experiment analysis and iterative refinement, without baseline code. (arxiv.org) On those 20 tasks, the agents beat the published human state of the art on four and fell short on 16. The benchmark spans language modeling, mathematics, bioinformatics and time-series forecasting. (arxiv.org) A research agent is not just a chatbot answering one prompt. It is a system built to take many steps in sequence, use tools, run code and adjust its plan after intermediate results. (nature.com) (arxiv.org) That matters because labs and companies spent the past year pitching “AI scientists” that could automate more of discovery. Nature has separately reported on projects including Sakana AI’s AI Scientist and FutureHouse’s research agents, both of which are framed as systems for multi-step scientific work rather than simple text generation. (nature.com 1) (nature.com 2) The new benchmark focuses on a part of science that is harder to fake than a polished answer: deciding which experiment to run, how to interpret a confusing outcome and when a promising-looking signal is probably noise. The AIRS-Bench authors said the tasks are meant to test end-to-end research ability, not just code completion or question answering. (arxiv.org) The Stanford report places that result inside a wider pattern of mixed progress. It says generative artificial intelligence reached nearly 53% population-level adoption within three years, even as evaluation methods struggle to keep up with what these systems can and cannot reliably do. (hai.stanford.edu) That leaves a narrower claim than some of the marketing around autonomous science suggested. The tools are spreading fast, but on complex research tasks the people still ahead are the ones choosing the question, designing the test and judging the answer. (nature.com)