AI struggles with diagnosis

Large language models and frontier AI systems often fail at stepwise clinical reasoning and early differential diagnosis, according to recent test results assembled across multiple models. Reporters summarised studies showing models missed appropriate early diagnoses more than 80% of the time and performed much weaker on differential reasoning and test-selection tasks than on final answers ( ).

Doctors are still better than chatbots at the first, messy step of diagnosis: figuring out what might be wrong before all the facts are in. In a new study, 21 large language models missed an appropriate early differential diagnosis more than 80% of the time. (jamanetwork.com) A differential diagnosis is a doctor’s running shortlist of possible causes, built as symptoms come in one by one. Researchers at Mass General Brigham tested models on 29 stepwise clinical cases from the MSD Manual and found the systems did much better on final answers than on early reasoning. (massgeneralbrigham.org) The paper was published in JAMA Network Open on April 13, 2026. It evaluated off-the-shelf models across five tasks: generating a differential diagnosis, choosing diagnostic tests, making a final diagnosis, assessing severity, and proposing management. (jamanetwork.com) The gap was stark. Differential diagnosis was the weakest category, with failure rates above 80% across models, while final diagnosis failure rates were below 40%, according to the study and summaries of the results. (news-medical.net) That difference matters in clinics because patients do not arrive with a finished chart and a confirmed condition. Doctors usually start with incomplete details, update their thinking after each answer or test, and rule out dangerous possibilities before settling on one diagnosis. (euronews.com) The researchers built a new scoring system called PrIME-LLM to measure that step-by-step process instead of grading only the final answer. Traditional accuracy scores for the same models clustered much closer together, but the new benchmark spread them out by showing where they struggled in sequence. (massgeneralbrigham.org) The models were tested without web search or external tools, and human graders scored each stage of the cases. In reporting on the study, Medical Xpress said the work was led by Mass General Brigham’s MESH Incubator and focused on whether these systems can support “clinical-grade” use. (medicalxpress.com) The study does not say the models are useless in medicine. It says they improve when given fuller information, but the authors warned that “off-the-shelf” systems are not ready for unsupervised clinical deployment, especially in early diagnostic reasoning. (technologynetworks.com) That leaves the same conclusion the paper started with: a chatbot may sound confident at the end of a case, but the weak point is still the beginning, when the right next question or test can change everything. (jamanetwork.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.