Harvard trial: AI triage beats doctors
- Harvard Medical School and Beth Israel Deaconess researchers said on April 30 their Science study showed OpenAI’s o1-preview beat physicians on real ER triage tasks. - In 76 Boston emergency cases, o1 hit 67.1% exact-or-very-close diagnostic accuracy at triage, versus 55.3% and 50.0% for two attendings. - The result matters because triage is early, messy medicine — but this was retrospective testing, not autonomous live patient care.
Emergency-room triage is the hard mode of medicine. You get a thin slice of information, almost no time, and a real chance that the sickest person looks deceptively fine. That is why this Harvard-led result landed so hard: in a new Science paper published April 30, researchers said OpenAI’s o1-preview beat physician baselines on several clinical reasoning tasks, including the earliest stage of ER decision-making. But the catch is important — this was a study on real cases reconstructed from records, not an AI running the front desk of a live hospital. (hms.harvard.edu) ### What actually got tested? The team from Harvard Medical School and Beth Israel Deaconess Medical Center compared a large language model with physicians across multiple kinds of clinical reasoning — published case challenges, management decisions, and real emergency-department cases. In the ER portion, the model saw(hms.harvard.edu)ons. That setup matters because it tries to mimic the way information arrives in actual care — gradually, and with lots missing early on. (hms.harvard.edu) ### Why is triage the big deal here? Because triage is where uncertainty is highest. A doctor or nurse is not solving the whole case yet. They are making a fast call about likely diagnoses, urgency, and next steps with incomplete clues. Turns out that is where o1-preview looked strongest. In the 76 emergency-room cases(hms.harvard.edu) physicians at 55.3% and 50.0%. (harvardmagazine.com) ### Was this just a cherry-picked demo? Not really, at least not in the obvious sense. The paper used several different benchmarks and compared the model against hundreds of clinicians in the broader study. The emergency-room arm used 76 real cases, and blinded physician reviewers judged the outputs without knowing whether they came from the AI or f(harvardmagazine.com)ages. (hms.harvard.edu) ### So did AI “beat doctors”? In this study, on these tasks, yes. But that sentence needs guardrails. The model outperformed physician baselines in retrospective evaluations of reasoning from chart data. That is different from safely handling a live patient who is confused, unstable, nonverbal, or giving incomplete history. The model was also text-only here — no visual exam, no bedside intuition, no family dynamics, no hallway chaos. (hms.harvard.edu) ### Why might AI be good at this? One reason is breadth. A reasoning model can chew through differential diagnoses without getting tired or anchored on the first plausible story. The researchers also said the model did especially well on rare diseases and complex published cases that are famous for distracting even experienced clinicians. Basically, it may be unusually good at keeping many possibilities alive at once. (harvardmagazine.com) ### What is the real limitation? Prospective evidence. The authors are not saying hospitals should replace triage clinicians with a chatbot. They are saying systems this capable now deserve the kind of controlled clinical trials used for other serious medical interventions. That is a big shift in framing — from “interesting benchmark toy” to “tool that might be ready for supervised real-world testing.” (hms.harvard.edu) ### What happens next? The obvious next step is human-plus-AI, not human-versus-AI. If a model can surface overlooked diagnoses or flag a patient who sounds safer than they are, it could help in exactly the part of emergency care where mistakes are most expensive. But hospitals will need proof on workflow, safety, bias, liability, and whether clinicians actually perform better with the tool than without it. (hms.harvard.edu) ### Bottom line? This was not an AI replacing ER doctors. It was a serious sign that on one of medicine’s messiest cognitive tasks, a frontier model can already outperform expert humans in retrospective testing. That moves the conversation out of hype and into trial design — which is where it belongs. (hms.harvard.edu)