AI model outperforms ER doctors

- Harvard Medical School and Beth Israel Deaconess researchers reported in Science on April 30 that OpenAI’s o1-preview beat attending physicians on real ER cases. - On 76 emergency-department cases, o1 hit 67.1% exact-or-near diagnoses at triage, versus 55.3% and 50.0% for two physicians. - The bigger shift is benchmark design — AI is now clearing physician-level text reasoning tests, so the next step is live clinical trials.

Emergency diagnosis is the hard version of medicine. You get incomplete information, the clock is running, and the wrong guess can send care down the wrong path fast. That is why this new Harvard-led result matters. In a paper published April 30 in *Science*, researchers from Harvard Medical School and Beth Israel Deaconess Medical Center showed that OpenAI’s o1-preview outperformed physicians on several clinical reasoning tasks, including real emergency-room cases. ### What exactly did they test? This was not just a medical exam or a polished textbook vignette. The team ran six experiments covering published diagnostic cases, management decisions, and 76 real emergency-department cases from Beth Israel. For the ER portion, the model saw only the information available at each stage of care — first triage, then the ER physician stage, then admission to the floor or ICU. (hms.harvard.edu) ### Why is the ER result the headline? Because triage is where uncertainty is highest. At that first touchpoint, there is the least data and the most pressure to make a good call. That is also where o1-preview opened the clearest gap: 67.1% exact-or-very-close diagnostic accuracy, versus 55.3% and 50.0% for two attending physicians. Basically, the model did best where the human job is messiest. (hms.harvard.edu) ### Was this some unfair AI-friendly setup? Not really. The researchers say they used the same raw electronic health record information available to clinicians at the time and did not pre-process it into a cleaner benchmark. Two other attending physicians graded the outputs without knowing whether they came from humans or AI. That matters because it turns this from a demo into a real head-to-head reasoning test. (science.org) ### Which model was it? The star here was OpenAI’s o1-preview — the reasoning model released in September 2024. The paper also compared it with earlier nonreasoning models like GPT-4 and 4o. Across the broader set of experiments, the reasoning model consistently beat prior-generation systems and often beat physician baselines too. So this is not just “AI got a little better.” It looks more like a step-change from fluent text generation to stronger structured reasoning. (hms.harvard.edu) ### Does this mean AI can replace ER doctors? No — and that is the part people will oversimplify. The study was text-only. The model did not see the patient, notice body language, catch a smell in the room, or feel that something was off in the way experienced clinicians sometimes can. Emergency medicine is not just diagnosis generation. It is also teamwork, procedures, communication, risk management, and accountability. (science.org) ### So what does it mean? It means the old way of evaluating medical AI is starting to break. For years, models were tested on multiple-choice questions and neat benchmark sets. But when systems start hitting the ceiling there, the real question becomes whether they can reason through messy cases as well as clinicians. This study argues that they now can — at least on text-based tasks — which is why the authors say prospective clinical trials are now urgent. (science.org) ### What happens next? The obvious near-term use is not autonomous diagnosis. It is second opinions, triage support, and a backstop for missed possibilities — especially early, when information is sparse. Think of it less like a robot doctor and more like a tireless diagnostic sparring partner. But the catch is deployment: hospitals would need validation, workflow design, oversight, and proof that these gains hold up in live care, not just retrospective review. (hms.harvard.edu) ### Bottom line The news is not that doctors are obsolete. It is that one of medicine’s hardest cognitive tasks — making a decent call from messy early evidence — now has an AI contender that can beat experienced humans in controlled testing. That moves the conversation out of hype and into implementation. (hms.harvard.edu)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.