Data is the bottleneck

Stanford’s 2026 AI Index flagged data as the new bottleneck in biology and medicine, saying scaling high‑quality datasets is now a central constraint for progress. That theme was echoed by posts stressing rigorous data‑quality checks and improved relevance and coverage for supply‑chain and clinical datasets. (x.com/MatrixAINetwork/status/2043734520420282542) (x.com/fiscal_ai/status/2043782851947081926) (x.com/SocketSecurity/status/2043743687184699777)

Artificial intelligence in biology and medicine is running into a simpler problem than model design: there is not enough high-quality data to train and test it well. (hai.stanford.edu) Stanford’s 2026 Artificial Intelligence Index, released April 13, split out standalone chapters on science and medicine for the first time and said the “data infrastructure needed to track AI’s impact” is struggling to keep pace. The report said the cost of incomplete data is rising as AI moves deeper into clinics and laboratories. (hai.stanford.edu) In the medicine chapter, Stanford said biological model development is “increasingly bottlenecked on data rather than architecture.” It pointed to a 2025 shift toward distilled datasets of artificial-intelligence-predicted structures and combined experimental sources that expanded training sets from hundreds of thousands of entries to tens of millions. (hai.stanford.edu) The science chapter shows why bigger models alone are not enough. On PaperArena, the best artificial intelligence agent reached 38.8% accuracy on end-to-end research tasks versus an 83.5% baseline for Doctor of Philosophy experts, and on BixBench, frontier models scored about 17% on real-world bioinformatics analysis. (hai.stanford.edu) Medical data is harder to scale than internet text because hospitals must remove identifying details, follow privacy rules, and capture years of patient history across many visits. Stanford researchers wrote in February 2025 that widely used benchmarks such as Medical Information Mart for Intensive Care, or MIMIC, do not contain full longitudinal trajectories for chronic disease management and multi-visit care. (hai.stanford.edu) That gap shows up in the numbers. Stanford said electronic health record data appeared in only 5% of studies evaluating healthcare uses of large language models, and its replacement benchmark set — EHRSHOT, INSPECT, and MedAlign — covers 25,991 patients, 441,680 visits, and 295 million clinical events. (hai.stanford.edu) The same report shows deployment is moving faster than evidence. Stanford Medicine said in January 2026 that more than 1,200 artificial-intelligence-enabled medical tools had already been cleared by the Food and Drug Administration, even as many claims of physician-level performance still come from narrow benchmarks or controlled tests. (med.stanford.edu) Stanford’s AI Index also found that 258 artificial intelligence medical devices were authorized in 2025, mostly through modification pathways that did not require new clinical trials. Only 2.4% of devices with clinical studies were backed by randomized trial data. (hai.stanford.edu) Some parts of science have moved faster because the data already exists at scale. Stanford said earth science draws on government and academic archives, astronomy released a 100-terabyte training dataset in 2025, and weather models reached operational use after decades of satellite and reanalysis records. (hai.stanford.edu) The pattern is now clear across fields: where researchers have broad, well-labeled, representative datasets, artificial intelligence systems move into real workflows sooner; where data is sparse, siloed, or hard to verify, progress slows. Stanford’s 2026 report says that constraint now sits at the center of biology and medicine. (hai.stanford.edu)

Data is the bottleneck

Get your own daily briefing