OpenAI o1 hits 67% triage accuracy

- Harvard Medical School and Beth Israel researchers published a Science study showing OpenAI’s o1 beat two attending physicians on 76 real ER triage cases. - The sharpest number is 67.1% exact-or-near diagnostic accuracy at triage for o1, versus 55.3% and 50.0% for the doctors. - It matters because the edge appeared in the messiest moment — early, text-only triage — but the study still stops short of bedside use.

Emergency diagnosis is one of the hardest versions of medicine. You get partial facts, noisy notes, time pressure, and a real chance that missing one detail hurts someone. That is why this new Harvard- and Beth Israel–led study lands so hard: OpenAI’s o1 reasoning model did better than two attending physicians on a set of real emergency department cases, especially at triage, when the chart is thinnest and the stakes are highest. The paper ran in *Science* on April 30, 2026, and it is less a victory lap than a warning shot — these systems are now good enough that hospitals and regulators have to decide what “safe use” actually means. (science.org) ### What did the researchers actually test? They did not just hand the model polished exam questions. The team, led by researchers at Harvard Medical School with collaborators including Beth Israel Deaconess Medical Center, ran several experiments against physician baselines and prior models. The real-world piece used 76 randomly selected emergency department (science.org)wo internal medicine attending physicians at three points in care: initial ER triage, first physician evaluation, and hospital or ICU admission. (science.org) ### Why is triage the hard part? Triage is the moment when almost everything important is still missing. You may have a chief complaint, a few symptoms, messy electronic record text, and not much else. That is exactly where o1 opened the biggest gap. In the *Science* commentary, the authors call out 67.1% exact-or-very-close diagnostic accuracy at initial tria(science.org)is not “slightly better on a benchmark.” That is better performance in the noisiest, most time-starved slice of the workflow. (science.org) ### Was this the final o1 model? Not exactly. The main paper evaluated the o1 series, and the emergency-room headlines often point specifically to o1-preview, the earlier reasoning model OpenAI released in September 2024. The supplement says the researchers used both o1-preview and o1 in different parts of the study, with fixed checkpoint versions and no(science.org)1 reasoning family, not some mystery hospital-only system. (science.org) ### Did the AI just memorize textbook cases? That is the obvious objection, but the design here tried to avoid it. The researchers say they used authentic emergency department records and did not pre-process them into neat prompts. Reviewers who scored the outputs were blinded — they did not know whether a diagnosis came from(science.org)ally, the model was operating on the same kind of ugly text humans see, not a cleaned-up school version of medicine. (science.org) ### So is AI ready to run the ER? No — and that is the part people will overread. This was text-only reasoning. The model did not see the patient, read body language, inspect a rash, hear breathing, or integrate imaging and waveforms the way a real clinician does. The study authors and the accompanying *Science* commentary both push the same point: these res(science.org) autonomous deployment. (science.org) ### Why does this matter beyond one paper? Because medicine has had a long history of AI looking smart on curated tests and then falling apart in real workflows. This study clears a higher bar. It compares against human physicians, uses real cases, and shows the strongest gains exactly where hospitals most need help — early sorting, differential diagnosis, a(science.org)e is not “AI doctor replaces doctor.” It is more like a second set of eyes that catches the cannot-miss diagnosis before the system does. (science.org) ### Bottom line The headline is real: OpenAI’s o1 family beat two attending physicians on real ER triage diagnoses in a *Science* study. But the deeper story is narrower and more important — AI is starting to become genuinely useful in the messiest part of clinical reasoning, and now the hard problem shifts from capability to safe deployment. (science.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.