OpenAI o1 triage hits 67% accuracy

- Harvard and Beth Israel researchers reported in Science that OpenAI’s o1-preview beat two attending physicians on 76 real ER cases. - At initial triage, o1 hit 67.1% exact-or-very-close diagnostic accuracy, versus 55.3% and 50.0% for the two doctors. - The result matters because this used messy, real clinical records — but the paper calls for prospective trials, not AI-only care.

A medical triage result like this lands hard because triage is the messy part of medicine. There’s not enough time, not enough data, and often not enough clarity. That’s why the new Harvard-led paper matters — it didn’t test an AI on neat exam questions, but on real emergency department cases from Beth Israel Deaconess Medical Center. In that setup, OpenAI’s o1-preview beat two attending physicians at the earliest triage stage, hitting 67.1% exact-or-very-close diagnostic accuracy versus 55.3% and 50.0%. (science.org) ### What was actually tested? This was a Science paper from a team at Harvard Medical School, Beth Israel Deaconess, and collaborators including Stanford. The researchers ran six experiments in total, but the one getting attention used 76 actual emergency department cases and compared AI, prior models, and physicians at three d(science.org)tal admission. (science.org) ### Why is triage the hard part? Because triage is medicine with half the puzzle missing. A patient arrives with a few symptoms, partial history, scattered chart notes, and maybe misleading details. The paper says the biggest gap showed up exactly there — the first touchpoint, where urgency is highest and information is thinnes(science.org) curated cases, but fell apart once the data got noisy. (science.org) ### Did the AI just memorize textbook cases? Not in the part people care about most. The ER study used real patient records from a major academic medical center, and the team says the models got the same text-based information available in the electronic record at each moment, without cleaning it up first. So this was closer to a second-opinion simulation than a classroom quiz. (science.org) ### How big was the gap? Big enough to matter, but not big enough to declare victory. At initial triage, o1 reached 67.1%. The two attending physicians scored 55.3% and 50.0%. The commentary in Science says o1 matched or exceeded physician-level performance across the broader set of authentic diagnostic tasks they tested, and i(science.org)ases. (science.org) ### Does that mean AI is ready to run the ER? No — and that’s the catch. This was retrospective, text-only, and limited. The model wasn’t examining patients, reading facial expressions, noticing who looks sick across the room, or integrating imaging and other nontext signals the way clinicians do in practice. The paper and the (science.org)side clinicians, inside real workflows, with safety monitoring. (science.org) ### Why could a model do well here? Basically, reasoning models are built to work step by step instead of jumping straight to an answer. In text-heavy medicine, that can help with differential diagnosis — sorting signal from distraction and keeping multiple possibilities alive. The authors frame this as a jump from benchmark sa(science.org)el may now be good enough that medicine has to test it seriously, not just admire demo scores. (science.org) ### What changes now? Hospitals probably won’t hand triage over to AI. But they may start looking harder at AI as a structured second opinion — especially early in the encounter, when uncertainty is highest. That could mean pilots for diagnostic support, escalation prompts, or test-planning assistance. The real fight now shifts(science.org)error?” (science.org) ### Bottom line? The news is not that AI replaced ER doctors. It’s that a reasoning model cleared a benchmark medicine actually respects: messy, real-world cases with human baselines. That moves the conversation out of hype and into workflow, oversight, and trials — which is where serious clinical technology either becomes useful or falls apart. (science.org)

OpenAI o1 triage hits 67% accuracy

Get your own daily briefing