Nature Medicine finds ChatGPT mistriages
- Nature Medicine published Mount Sinai’s first independent stress test of OpenAI’s ChatGPT Health, showing the January 2026 tool missed many simulated emergencies. - In 960 responses across 60 clinician-written cases, ChatGPT Health undertriaged 52% of true emergencies and overtriaged 35% of nonurgent presentations. - That matters because OpenAI said ChatGPT Health reached about 40 million daily users within weeks of launch.
Medical triage is the part where someone decides how fast you need care. That sounds simple, but it’s the whole game in emergencies. If a tool tells you “see someone in 24 to 48 hours” when you actually need an ER, that’s not a minor miss — it’s the dangerous kind. That’s why this new Nature Medicine paper landed so hard: Mount Sinai researchers put OpenAI’s ChatGPT Health through a structured stress test and found the model was least reliable at the exact moments where judgment matters most. ### What is ChatGPT Health? ChatGPT Health is OpenAI’s consumer health feature, launched on January 7, 2026, to answer health questions and suggest how urgently someone should follow up with a clinician. The key point is that this is a public-facing product, not a hidden back-office tool for doctors. So a bad call can reach a patient directly, without a nurse or physician catching it first. (nature.com) ### What did the researchers actually test? They built 60 clinician-authored case vignettes across 21 medical domains, then ran those cases under 16 different conditions, producing 960 total ChatGPT Health responses. This was a stress test, basically — not a real-world hospital trial, but a controlled way to see how the system behaves when symptoms get tricky, ambiguous, or socially nudged. (nature.com) ### Where did it fail? The pattern was an inverted U. ChatGPT Health did better in the middle — moderately urgent cases where the signal was clearer. But at the edges, it broke. The paper says the most dangerous failures clustered in emergency cases and nonurgent cases. In other words, it sometimes waved off serious problems and sometimes sounded the alarm on mild ones. ### How bad were the emergency misses? (nature.com) Bad enough to be the headline. Among cases that the physician gold standard labeled emergencies, ChatGPT Health undertriaged 52% of them. The paper highlights diabetic ketoacidosis and impending respiratory failure as examples where the tool sometimes told people to seek evaluation within 24 to 48 hours instead of going to the emergency department. But it did recognize more obvious textbook emergencies like stroke and anaphylaxis. (nature.com) That split is important — it suggests the misses weren’t random so much as concentrated in edge cases that still look subtle to a language model. ### What about low-risk cases? The other side of the problem was overtriage. For nonurgent presentations, the system frequently pushed users toward more urgent care than the gold standard called for. That is less dangerous than missing an emergency, but it still matters. It can send people into already crowded urgent-care clinics and ERs, and it trains users to treat the tool’s caution as noise. A triage system that cries wolf too often and misses wolves sometimes is not a stable front door to care. (nature.com) ### Did the prompts around the case matter? Yes — and this part is unsettling. When family or friends in the vignette minimized the symptoms, the model’s recommendation shifted significantly in borderline cases, mostly toward less urgent care. That means the surrounding narrative could tug the answer away from the medical facts. Think of it like a triage nurse who starts listening to the anxious cousin instead of the pulse ox. (nature.com) ### Were there other safety problems? Yes. The study also found inconsistent suicide-crisis safeguards. Crisis-intervention messages did not trigger reliably across suicidal-ideation scenarios and sometimes appeared in lower-risk cases while failing to appear when a specific self-harm method was described. That’s not a formatting bug — that’s a safety-layer problem. ### Why does this matter beyond one paper? (nature.com) Because scale changes the risk. Mount Sinai notes that OpenAI said ChatGPT Health reached about 40 million daily users within weeks of launch. At that size, even a small failure rate becomes a public-health issue. The bottom line is simple: consumer AI can be useful for health information, but triage is the hard part, and this paper says human review and prospective validation are not optional yet. (mountsinai.org) (nature.com)