Harvard AI beats ER triage accuracy
- Harvard Medical School and Beth Israel Deaconess researchers said on April 30 that OpenAI’s o1-preview beat two attending physicians on real ER triage cases. - In 76 Boston emergency cases, o1-preview hit 67.1% exact-or-close triage accuracy, versus 55.3% and 50.0% for the two physicians. - The result matters because triage is the noisiest moment in care — and the team now wants prospective clinical trials.
Emergency-room triage is the hardest version of diagnosis. You have very little information, the clock is loud, and the cost of missing something serious is huge. That is why this Harvard-led result landed so hard: in a study published April 30, researchers found that OpenAI’s o1-preview outperformed two attending physicians on real Beth Israel Deaconess emergency cases, especially at the very first triage step. The point is not that a chatbot should run the ER. The point is that AI may now be good enough to test as a real clinical second opinion. (hms.harvard.edu) ### What actually got tested? This was not a medical-exam stunt. The team from Harvard Medical School, Beth Israel Deaconess, and collaborators including Stanford compared an AI reasoning model with physicians on several clinical reasoning tasks, then ran a real-world emergency-department tes(hms.harvard.edu)e — no cleaned-up summaries, no extra hints. (hms.harvard.edu) ### Why is triage the hard part? Triage happens before the picture is clear. You may have vital signs, a short note, a few demographics, and not much else. That is exactly when humans are most vulnerable to noise, overload, and anchoring on the wrong early clue. The study’s striking result is that the AI’s edge was biggest there — at the first touchpoint, when uncertainty is highest. (techcrunch.com) ### How big was the gap? At initial ER triage, o1-preview reached 67.1% exact-or-very-close diagnostic accuracy. The two attending physicians scored 55.3% and 50.0%. The model also matched or exceeded expert performance at later stages, including first physician contact and admission decisions, but the triage gap is the number people will remember because it is the scariest moment to be wrong. (techcrunch.com) ### Does that mean the AI is “better than doctors”? Not in the simple headline sense. The study tested text-based reasoning on retrospective cases. The AI did not examine patients, notice body language, feel an abdomen, or manage a chaotic room with interruptions. Basically, it showed stronger d(techcrunch.com)place the doctor.” (techcrunch.com) ### Why are researchers treating this as a turning point? Because old benchmarks are getting too easy. Harvard’s team argued that multiple-choice style medical tests no longer tell you much when frontier models are already near the ceiling. Real clinical work is messy. This study pushed evaluation closer to that mess — published case conferences, reasoning tasks, and actual emergency records — and the model still held up. (hms.harvard.edu) ### What is the catch? Safety and workflow. A strong retrospective score does not tell you how clinicians will use the tool under pressure, whether it creates overtrust, or whether it helps some patient groups more than others. It also focused on text inputs, while real emergency care include(hms.harvard.edu)techcrunch.com) ### So what happens next? The researchers are calling for prospective clinical trials — meaning studies where the tool is tested in live care settings with guardrails, not just graded afterward on old cases. That is the right next step. Medicine has had plenty of flashy demos. What matters now is whether an AI second opinion can reduce misses, speed decisions, and help clinicians without creating new failure modes. (hms.harvard.edu) ### Bottom line? This looks like a real milestone in medical AI, not just another benchmark win. But the news is not “the ER is automated now.” The news is that a Harvard-led team found enough signal in real emergency cases to justify testing AI beside clinicians — and that is a much bigger deal than another perfect score on an exam. (hms.harvard.edu)