ChatGPT Health Misses 52% of Emergencies

A new study found that ChatGPT Health failed to flag 52% of serious medical emergencies, under-triaging potentially life-threatening cases. Conversely, it overreacted with lower-risk scenarios, with 64.8% of safe individuals incorrectly flagged as needing emergency care. These findings highlight the dangers of relying solely on AI for critical health decisions.

The study, published in *Nature Medicine*, was the first independent safety evaluation of the AI tool since its launch in January 2026. Researchers from the Icahn School of Medicine at Mount Sinai created 60 realistic patient scenarios across 21 medical specialties and tested them under 960 different variations to simulate real-world complexity. While the AI performed well with "textbook" emergencies like stroke or severe allergic reactions, it struggled with more nuanced situations requiring clinical judgment. For instance, it failed to recommend emergency care for scenarios involving diabetic ketoacidosis and impending respiratory failure, instead suggesting a 24-48 hour evaluation. In one asthma case, the system noted signs of respiratory failure but still advised waiting instead of seeking immediate treatment. A particularly concerning finding involved the tool's inconsistent activation of suicide crisis safeguards. The system was less likely to refer users to the 988 crisis hotline when they described a specific plan for self-harm compared to when they were less specific, an inversion of clinical risk assessment. Adding normal lab results to a scenario where a user expressed suicidal thoughts caused the crisis intervention banner to disappear entirely. The study also found that social context could significantly sway the AI's recommendations. When a scenario included a friend or family member downplaying the symptoms, the AI was nearly 12 times more likely to recommend a less urgent level of care. This highlights a vulnerability to anchoring bias that could lead to dangerous delays in treatment. In response to the findings, OpenAI acknowledged the need for continued research and improvement. Experts not involved with the study have described the results as a "wake-up call" and "unbelievably dangerous," emphasizing the need for independent, routine auditing of consumer-facing health AI tools before they are widely deployed.

ChatGPT Health Misses 52% of Emergencies

Get your own daily briefing