LLMs fail with missing data

A Financial Times analysis (published in JAMA) found leading large language models make major diagnostic errors when clinical information is incomplete — error rates above 80% on the tests reviewed. The study tested 21 LLMs and reported error rates falling to under 40% only when full clinical details were provided, raising clear questions about liability for decisions made from partial chatbot output (x.com). Multiple shares of the analysis also warned courts and practitioners that early legal cases may hinge on whether clinicians relied on incomplete AI responses when making care decisions (x.com).

Doctors do not diagnose from one clue, and the newest medical chatbots still break down when they have to. A JAMA Network Open study found 21 large language models missed appropriate early diagnoses in more than 80% of cases when key clinical details were still missing. (jamanetwork.com) The researchers tested 21 off-the-shelf models on 29 standardized clinical vignettes from the January 2025 update of the MSD Manual, scoring 16,254 responses across five parts of the workflow: differential diagnosis, diagnostic testing, final diagnosis, management, and other reasoning tasks. Analyses ran from January through December 2025. (jamanetwork.com) A differential diagnosis is the doctor’s running shortlist before the answer is obvious, like narrowing a mystery before the last chapter. That was the weakest step for the models, while final diagnosis and management scored much better once the full case was available. (massgeneralbrigham.org, jamanetwork.com) Mass General Brigham, which led the study, said every model reached the correct final diagnosis more than 90% of the time when researchers supplied all pertinent case information. The same systems still struggled to build a testable early list of possibilities when the presentation was incomplete. (massgeneralbrigham.org) That gap cuts directly into how clinicians actually work, because patients do not arrive with every lab, symptom, and history detail neatly assembled. The JAMA paper says common benchmark tests can overstate performance by handing models the whole case up front instead of forcing step-by-step reasoning. (jamanetwork.com) The top-scoring model in the study was Grok 4 with a PrIME-LLM score of 0.78, while Gemini 1.5 Flash scored 0.64, and the authors said reasoning-optimized models outperformed nonreasoning models overall. Even so, the paper’s key finding was that early diagnostic reasoning remained the weakest domain across the field. (jamanetwork.com) Marc Succi of Mass General Brigham said off-the-shelf models are “not ready for unsupervised clinical-grade deployment,” and framed the safer role for these tools as support for physicians rather than replacement. The group built a new benchmark, called the Proportional Index of Medical Evaluation for Large Language Models, to expose uneven performance that average accuracy scores can hide. (massgeneralbrigham.org, jamanetwork.com) The regulatory backdrop is shifting at the same time. The Food and Drug Administration’s clinical decision support guidance, updated in January 2026, says software aimed at health professionals must let them independently review the basis for recommendations if it is to fit the agency’s non-device category. (fda.gov, fda.gov) The Financial Times reported on March 29, 2026 that the findings sharpen liability questions for hospitals, vendors, and clinicians if care decisions are influenced by partial chatbot output. Early court fights are likely to turn on ordinary negligence questions: what the tool showed, what information was missing, and whether a clinician treated the output as advice or as a substitute for judgment. (ft.com, medicaleconomics.com) The study does not say language models are useless in medicine; it says they look strongest at the end of the case, after much of the hard reasoning has already been done. That is the same point the opening result makes: when the facts are incomplete, the chatbot is most likely to fail. (massgeneralbrigham.org, jamanetwork.com)

LLMs fail with missing data

Get your own daily briefing