Study finds AI misdiagnoses often
A recent study reports that language models fail to produce an appropriate early patient diagnosis more than 80% of the time, underscoring limits in unsupervised clinical triage. The findings emphasise that conversational fluency does not equal reliable medical reasoning, with implications for any high-stakes public service that uses AI to advise users. (euronews.com)
Doctors often start with a short list of possible causes, not one answer. A new JAMA Network Open study found 21 large language models missed that early step more than 80% of the time. (jamanetwork.com) The study was published April 13, 2026, by researchers at Mass General Brigham. They tested 21 off-the-shelf models on 29 standardized clinical vignettes from the January 2025 MSD Manual update, producing 16,254 scored responses. (massgeneralbrigham.org, jamanetwork.com, msdmanuals.com) The researchers walked each model through a real clinic-style sequence: suggest possible diagnoses, choose tests, name a final diagnosis, and propose treatment. Real-time web search and other add-on tools were turned off during testing, and medical students scored the answers against answer keys. (beckershospitalreview.com, jamanetwork.com) That first step is called differential diagnosis. It is the part where a clinician keeps several explanations in play while the facts are still incomplete, and the study found it was the weakest category across all 21 models. (euronews.com, jamanetwork.com) The models looked better later in the encounter, after they were given exam findings and lab results. Failure rates on final diagnosis were below 40% across all models and fell as low as 9% for the best performers when more information was available. (beckershospitalreview.com, euronews.com) The top overall score in the paper went to Grok 4, with a PrIME-LLM score of 0.78, while Gemini 1.5 Flash scored 0.64. The paper said reasoning-optimized models outperformed nonreasoning models, and GPT models scored highest overall as a family. (jamanetwork.com) The authors said the gap is not about memorizing medical facts. Marc Succi of Mass General Brigham said the systems still lack the reasoning needed for safe frontline use, while lead author Arya Rao said they struggle most at the open-ended start of a case, before the record is complete. (massgeneralbrigham.org, euronews.com) Hospitals are already being warned about that risk. ECRI, a patient-safety nonprofit, listed balancing the benefits and risks of artificial intelligence in clinical diagnosis as its No. 1 patient safety concern for 2026. (home.ecri.org, home.ecri.org) Mass General Brigham’s own conclusion was narrower than a blanket rejection. The health system said the most responsible use right now is targeted, clinician-supervised deployment in low-uncertainty tasks, not unsupervised patient-facing diagnosis. (massgeneralbrigham.org, beckershospitalreview.com)