Harvard trial: AI beats ER doctors

- Harvard Medical School and Beth Israel researchers reported in Science that OpenAI’s o1 reasoning model beat attending ER doctors on text-based triage diagnoses. - In 76 real emergency cases, o1 hit 67.1% exact-or-close diagnostic accuracy at triage, versus 55.3% and 50.0% for two physicians. - The result sharpens pressure to test AI beside clinicians, not just on exams, as staffing shortages keep pushing automation forward.

Emergency-room triage is the hard version of diagnosis. You get partial information, noisy notes, and almost no time. That is why this Harvard-led result landed so hard: a reasoning model from OpenAI beat experienced physicians on text-based emergency cases pulled from real hospital care, and the paper ran in Science on April 30. But the news is not “AI can replace doctors.” The real story is that AI is getting unusually good at one narrow, high-value slice of medicine — clinical reasoning from charts. (science.org) ### What actually got tested? The team from Harvard Medical School and Beth Israel Deaconess Medical Center did not hand the model a medical licensing exam and call it a day. They ran six experiments built around physician work: differential diagnosis, test selection, management reasoning, and one especially important real-world set of 76 emergency-department cases evaluat(science.org)a newer o1 model used in part of the emergency-case comparison. Both were text-only — no scans, no bedside exam, no live patient interaction. (science.org) ### Why is triage the interesting part? Because triage is where uncertainty is highest. A clinician is trying to decide, from scraps, what might kill this patient first and what to do next. In the Science commentary, the standout number came from that first touchpoint: on real emergency cases, o1 reached 67.1% exact-or-very-close diagnostic accuracy at initial triage, ahea(science.org)s could not reliably tell the AI’s writeups from the humans’. That is a bigger deal than another benchmark win, because it is closer to the messy work doctors actually do. (science.org) ### Was it only better at one dataset? No — and that is part of why people are taking it seriously. In the supplementary material, the researchers say they also tested 143 New England Journal of Medicine clinicopathological conference cases from 2021 through September 2024. On those cases, the model generated differential diagnoses and suggested next tests, which attending(science.org)very-close diagnostic accuracy in 88.6% of those published cases, versus 72.9% for GPT-4. So this was not just “AI got lucky on one ER batch.” It beat the prior generation too. (science.org) ### So why not just put it in charge? Because text reasoning is not the same thing as medicine. The model never touched a patient. It did not notice a rash, smell ketones, see someone deteriorate, or deal with a lying history, bad vitals, or a broken workflow. Even the researchers seem worried people will overread the result. The Science commentary frames the paper as a si(science.org)— basically, the benchmark phase is ending and the safety phase has to start. (science.org) ### What makes this different from old “AI doctor” hype? Two things. First, this is reasoning, not just pattern matching on a multiple-choice test. Second, the comparison was against physician performance on authentic tasks, including real emergency cases. That gets closer to the old challenge in medicine: show a system can actually reason through diagnosis better than hum(science.org)d the paper as finally meeting a standard people had talked about for decades. (statnews.com) ### Where does this go next? Probably into copilots before autopilots. The obvious use is not an AI replacing the ER doctor. It is an AI generating a differential, flagging what cannot be missed, and suggesting tests or disposition options while the clinician stays accountable. That also fits the broader pressure on hospit(statnews.com)move into care, regulators and hospitals will need evidence from live prospective trials, not just retrospective chart tests. (science.org) ### Bottom line? The breakthrough is real, but narrower than the headlines make it sound. AI just cleared an important bar in diagnostic reasoning — especially in early emergency triage — and that means medicine has moved from “can it think through cases?” to “how do we use this without making care less safe?” (science.org)

Harvard trial: AI beats ER doctors

Get your own daily briefing