AI chatbots give risky health advice

A British Medical Journal–linked study reported that AI chatbots provided incomplete, misleading or risky medical advice in roughly half of test cases. A separate Medscape report found hallucination rates for fake medical references varied widely across nine chatbots, from 0% to 34%. (businessday.co.za) (medscape.com)

Artificial intelligence chatbots are giving users unsafe or misleading health advice often enough that researchers are now measuring the failures, not just debating the risk. (bloomberg.com) A BMJ Open study published this week tested five platforms — ChatGPT, Gemini, Meta AI, Grok and DeepSeek — with 10 questions across five health topics and found about half of all answers were problematic. Nearly 20% were rated highly problematic. (bloomberg.com) A separate Medscape report published April 14 said nine chatbots produced fake medical references at rates ranging from 0% to 34%. Medscape said Perplexity Research had the lowest hallucination rate and Grok 3 the highest. (medscape.com) These systems work by predicting the next likely word from patterns in training data, which can make an answer sound fluent before it is actually checked against clinical evidence. In medicine, that can turn a confident sentence into bad triage, a fake citation or advice that sends a patient to the wrong level of care. (nature.com) The clearest recent warning came from a Nature Medicine study of ChatGPT Health, OpenAI’s consumer health tool launched on January 7, 2026. Researchers stress-tested 60 clinician-written cases across 21 medical areas and found the most dangerous failures clustered at the edges: 35% of nonurgent presentations and 48% of emergency conditions. (nature.com) Among gold-standard emergencies, ChatGPT Health undertriaged 52% of cases, sometimes telling users with diabetic ketoacidosis or impending respiratory failure to seek care within 24 to 48 hours instead of going to the emergency department. The same study found suicide-crisis messages appeared unpredictably. (nature.com) Another physician-led study in npj Digital Medicine, published February 13, 2026, tested four public chatbots on 222 patient-style primary care questions and found problematic-response rates from 21.6% for Claude to 43.2% for Llama. Unsafe-response rates ranged from 5% for Claude to 13% for GPT-4o and Llama. (nature.com) The audience for these tools is already large. OpenAI said in January that more than 230 million people globally ask health and wellness questions on ChatGPT each week, and a West Health-Gallup poll released April 15 found roughly one-quarter of United States adults had used an artificial intelligence tool for health information or advice in the previous 30 days. (openai.com) (abcnews.com) OpenAI says ChatGPT Health is meant to support, not replace, medical care, and is not intended for diagnosis or treatment. The company also says Health chats are kept in a separate space and are not used to train its foundation models. (openai.com) Medical journals are now building rules for how these systems should be evaluated. The Chatbot Assessment Reporting Tool, published by BMJ and BMJ Medicine in 2025, set out a checklist for studies that test chatbots on health advice and evidence summaries. (bmj.com) (bmjmedicine.bmj.com) The pattern across the new studies is simple: the answers can sound polished even when the medicine is shaky. As more people use chatbots before or after a doctor visit, the pressure is shifting from novelty to proof. (nature.com)

AI chatbots give risky health advice

Get your own daily briefing