AI triage limits exposed

- Real-world ED pilots show foundation models struggle with sequential uncertainty in chest pain workups and live diagnostic reasoning. - Benchmarks like MedQA do not capture the stepwise decision-making and missing-data problems common in busy EDs. - Clinicians and experts are urging narrow, supervised tools with local validation and stricter governance before clinical deployment ( ).

Chest-pain pilots are exposing a gap in medical artificial intelligence: models that ace exam-style questions can still stumble when emergency doctors need step-by-step decisions with missing information. (link.springer.com) A February 24, 2026 study in *BMC Medical Informatics and Decision Making* tested GPT-4o on 500 real emergency-department chest-pain encounters drawn from a 202,632-visit cohort across three emergency departments. The model had to ask for information sequentially from 136 clinical variables, closer to a live workup than a static case vignette. (link.springer.com) In that simulation, life-threatening causes made up 2.14% of chest-pain visits, and GPT-4o’s baseline prompt over-predicted rare emergencies, with 79.3% sensitivity but 45.2% specificity. Prompts that added prevalence cues raised specificity to 83.0% or 94.7%, but sensitivity fell to 30.4% and 8.8%, respectively. (link.springer.com) The same paper found the model’s information-seeking pattern barely matched an “optimal” query path derived from a Bayesian network, a statistical model that updates odds as new facts arrive. GPT-4o also asked for fewer vital signs and lab tests than clinicians did, while requesting more imaging data. (link.springer.com) That mismatch cuts against the way medical artificial intelligence is often marketed. MedQA, one of the best-known benchmarks, is a multiple-choice board-exam dataset, and the English subset is described by its maintainers as “likely close to saturation,” with GPT-4 already scoring 86.1% in 2023. (ukgovernmentbeis.github.io) A 2025 *NEJM AI* paper made the same point from another angle, arguing that licensing-exam scores miss how clinicians revise judgments as new evidence arrives. Its authors built a 750-question script-concordance benchmark specifically to test decisions under uncertainty, not just recall of a correct answer. (ai.nejm.org) Emergency medicine is a hard place to hide that weakness. A 2025 review on acute chest pain said emergency departments are sorting conditions that range from muscle strain to acute coronary syndrome and aortic dissection, with guideline targets aiming for a missed acute coronary syndrome rate below 1%. (pmc.ncbi.nlm.nih.gov) Some narrow tools are showing more practical gains when they stay inside a tighter lane. A multisite quality-improvement study published in *JAMA Internal Medicine* in 2024 reported that an artificial-intelligence triage system for chest-pain patients shortened time to electrocardiogram, troponin testing, and treatment-related steps in emergency departments. (jamanetwork.com) Researchers and clinicians are drawing a line between that kind of bounded workflow tool and a general-purpose model asked to reason through a live diagnosis. Marzyeh Ghassemi, an MIT computer scientist, told GBH on April 22 that medical artificial intelligence can help only if the data and the deployment match the setting, and she warned that systems trained on the wrong populations can fail patients. (wgbh.org) The result is a narrower message than the exam scores suggest: use local validation, supervision, and governance before putting a model near bedside decisions. In chest pain, the test is not whether a model can pick an answer from five choices, but whether it can ask for the right next fact before time runs out. (link.springer.com)

AI triage limits exposed

Get your own daily briefing