Harvard AI outperforms ER doctors
- Harvard Medical School and Beth Israel Deaconess researchers said on April 30 that OpenAI’s o1-preview beat physicians on real ER triage and diagnosis tasks. - In 76 Boston emergency cases, o1 hit 67.1% exact-or-very-close diagnostic accuracy at triage, versus 55.3% and 50.0% for two attendings. - The bigger shift is evaluation itself — old multiple-choice medical AI tests are now too easy for frontier models.
Emergency-room triage is the hard version of medical diagnosis. The chart is messy. The patient is new. Time is short. And the doctor often has to act before the full story exists. That is why this new Harvard-led result matters — not because an AI aced another exam, but because it did better than physicians on text-based reasoning tasks that look a lot more like actual clinical work. The study, published April 30 in *Science*, came from Harvard Medical School and Beth Israel Deaconess Medical Center, with collaborators including Stanford, and tested OpenAI’s o1-preview model across real emergency cases and classic diagnostic benchmarks. (hms.harvard.edu) ### What did the model actually do? It did not examine patients, read scans directly, or walk around a hospital. It got the same kind of text information a clinician might have at a given moment — symptoms, notes, lab snippets, chart history — and then had to suggest likely diagnoses and what to(hms.harvard.edu)f care. (hms.harvard.edu) ### Why is the ER result the headline? Because the emergency department is where uncertainty is highest. In one experiment, the team used records from 76 real Boston emergency cases and stopped the clock at three points: arrival triage, first physician contact, and later admission decisions. Rev(hms.harvard.edu)ceeded the doctors across the stages, with its strongest edge showing up early, when the least information was available. (harvardmagazine.com) ### How big was the gap? At initial triage, o1 reached 67.1% exact-or-very-close diagnostic accuracy. The two attending physicians scored 55.3% and 50.0% on the same cases. On published New England Journal of Medicine clinicopathological conference cases — the famously thorny ones used for decades as a benchmark — reporting a(harvardmagazine.com) a tiny edge on a toy test. (theoutpost.ai) ### Why does this feel different from older “AI beats doctors” stories? Because older medical-AI wins often came from narrow tasks or multiple-choice style exams. The authors are pretty explicit that those benchmarks are hitting the ceiling now. Peter Brodeur, a co-first author, said models(theoutpost.ai)ed reasoning tasks with historical physician baselines and real chart data. (hms.harvard.edu) ### Does this mean AI is ready to replace ER doctors? No — and that is the part people will overread if they are not careful. These were retrospective, text-based evaluations, not live patient care. The model did not build trust with a patient, notice body language, catch a bad blood-pressure cu(hms.harvard.edu)e trials, not as a green light to hand medicine over to chatbots. (statnews.com) ### So where could this matter first? Front-door workflows. Triage support. Differential-diagnosis generation. Maybe routing the right patient to the right level of care faster, or helping a clinician not miss the weird diagnosis hiding inside a noisy chart. That is especially relevant in settings where doct(statnews.com)t AI alone. (harvardmagazine.com) ### What is the real takeaway? The news is not that doctors are obsolete. It is that frontier reasoning models have crossed into a level of clinical performance that older evaluation methods can no longer capture. Medicine now has a more uncomfortable question — not whether these systems are impressive, but where they help, where they fail, and how to test them before patients pay the price. (hms.harvard.edu)