LLMs give flawed advice
- Researchers tested 21 large language models with realistic patient symptoms and found many gave flawed medical guidance. - The summary reports an alarming 80% rate tied to unreliable health advice across the models tested. - That finding strengthens calls to restrict AI to clerical support and require clinician verification for clinical guidance (el-balad.com).
Large language models can name the right diagnosis after they see the whole chart, but a new study found they still miss the earlier reasoning steps doctors use to get there. (jamanetwork.com) Doctors do not start with one answer; they build a “differential diagnosis,” a ranked list of plausible causes, then order tests to narrow it down. In a JAMA Network Open study published April 13, 2026, researchers tested 21 off-the-shelf models on 29 standardized clinical vignettes across that full workflow. (jamanetwork.com) The study used a new benchmark called PrIME-LLM, which scores five parts of clinical reasoning: differential diagnosis, diagnostic testing, final diagnosis, management, and other reasoning tasks. The models produced 16,254 responses in total, and scores ranged from 0.64 for Gemini 1.5 Flash to 0.78 for Grok 4. (jamanetwork.com) The sharpest gap came at the start of the diagnostic process. Mass General Brigham, which led the study, said the 21 models failed to generate an appropriate differential diagnosis more than 80% of the time, even though all of them reached the correct final diagnosis more than 90% of the time when given all relevant case details. (massgeneralbrigham.org) That split helps explain why medical chatbot demos can look better than real-world use. Benchmarks built around multiple-choice tests or final-answer accuracy can reward pattern matching after the crucial clues have already been supplied. (jamanetwork.com) The concern is not only what a model knows, but what it does when information is missing, incomplete, or ambiguous. A separate 2026 medRxiv preprint using 1,000 synthetic headache transcripts found that incomplete histories triggered hazardous recommendations, including triage downgrades in up to 54.8% of life- or sight-threatening emergency cases. (medrxiv.org) Another 2026 study in npj Digital Medicine looked at patient-posed questions instead of clinician-written vignettes. It evaluated 888 chatbot responses to 222 advice-seeking questions and found problematic-response rates from 21.6% for Claude to 43.2% for Llama, with unsafe-response rates from 5% to 13%. (nature.com) Researchers at Oxford also tested how people use chatbots when they are trying to make decisions about symptoms. In a randomized trial involving nearly 1,300 online participants, people using large language models did not make better decisions than people using traditional methods such as online search or their own judgment. (ox.ac.uk) Mass General Brigham said the new findings support a “human in the loop” model, with physicians checking any clinical output before it affects care. Corresponding author Marc Succi said off-the-shelf models are “not ready for unsupervised clinical-grade deployment.” (massgeneralbrigham.org) The clearest line from these studies is narrow, not broad. Large language models may help with drafting, summarizing, and other clerical work, but the step where symptoms turn into medical judgment still belongs to a clinician. (jamanetwork.com)