Chatbots and medical advice study
A study reported April 13 examined how reliable medical advice from chatbots is, noting growing public use of AI systems for symptom assessment amid primary‑care shortages. The coverage frames health guidance from chatbots as a distinct research area where safety, escalation and scope need to be specified. (bostonglobe.com)
A Mass General Brigham study found 21 chatbots usually named the right diagnosis only after they were given the kind of full case details patients rarely have. (massgeneralbrigham.org) The researchers reported on April 13 that every model they tested got the final diagnosis right more than 90 percent of the time with complete information. The same systems failed to build an appropriate list of possible diagnoses more than 80 percent of the time when information was incomplete. (massgeneralbrigham.org) That earlier step is the doctor’s sorting process: turning a few symptoms into a shortlist of plausible causes and deciding which tests or urgent actions come next. The Mass General Brigham team said the models often stumbled there, even when they later landed on the correct answer. (massgeneralbrigham.org) The group created a benchmark called PrIME-LLM to score chatbots across four stages: possible diagnoses, testing, final diagnosis, and treatment planning. The paper was published in JAMA Network Open. (massgeneralbrigham.org; jamanetwork.com) The study lands as more patients use chatbots like ChatGPT, Claude, Gemini, Grok, and DeepSeek to interpret symptoms before they reach a clinic. Mass General Brigham said the 21-model comparison included current versions of those systems at the time the paper was submitted. (massgeneralbrigham.org) A separate University of Oxford-led study published in Nature Medicine in February tested nearly 1,300 people using chatbots for medical scenarios. It found chatbot users did no better than people using search engines or their own judgment to decide what condition they had or what care they needed. (ox.ac.uk) That Oxford study also found a two-way problem: users often left out details the model needed, and the model responses mixed sound advice with bad advice. The researchers said chatbots sometimes missed when a case needed urgent care. (ox.ac.uk) Researchers have been trying to measure this field more systematically because many earlier papers were hard to compare. A February 2025 JAMA Network Open review of 137 studies found 99.3 percent tested closed models and most papers did not report basic details such as model version or query date. (jamanetwork.com) The Lancet Primary Care wrote this year that generative chatbots are already being used to interpret symptoms, lab reports, and next steps in care, even though the systems draw from broad internet data and can change over time. The journal argued that real-world testing should come before routine use in primary care. (thelancet.com) Mass General Brigham’s corresponding author, Marc Succi, said “off-the-shelf” models are not ready for unsupervised clinical use. For now, the latest studies point to the same boundary: chatbots may assist a clinician, but they still miss too much when patients ask them to play doctor alone. (massgeneralbrigham.org; ox.ac.uk)