AI beats ER doctors, 67% vs 50-55%

- Harvard Medical School and Beth Israel Deaconess researchers reported on April 30 that OpenAI’s o1-preview outperformed attending ER physicians on real triage cases. - In 76 emergency-department cases, o1-preview hit 67.1% diagnostic accuracy, versus 55.3% and 50.0% for two blinded attending physicians. - The result matters because it used messy real hospital records, but still excluded bedside cues and real-time patient interaction.

Emergency diagnosis is the hard mode of medicine. Patients show up with incomplete stories, noisy symptoms, and very little time. That is why this new result landed so hard: a Harvard- and Beth Israel-led team says OpenAI’s o1-preview beat experienced emergency physicians on real ER cases, at least in a controlled text-only test. The paper appeared in *Science* on April 30, and it is one of the clearest signs yet that reasoning-style AI is moving from toy demos into genuinely difficult clinical work. ### What actually got tested? This was not a chatbot winging it on made-up symptoms. The team ran six experiments, including one built from 76 real cases from Beth Israel Deaconess Medical Center. They compared o1, older models like GPT-4o, and physicians across three diagnostic moments: initial ER triage, the ER physician stage, and admission to the floor or ICU. ### Where does the 67% number come from? At the triage stage of those 76 real emergency cases, o1-preview reached 67.1% exact or near-exact diagnostic accuracy. The two attending physicians in the blinded comparison scored 55.3% and 50.0%. That gap is the headline because triage is where information is thinnest and uncertainty is highest. Triage is the moment when nobody has the full picture yet. You have a few notes, maybe some vitals, maybe a messy history, and you still need to decide what could be going wrong. Basically, it is like trying to solve a puzzle with half the pieces flipped over. If an AI does well there, the promise is not just “better answers” — it is faster narrowing of possibilities and better next tests. ### Did the model only win on ER cases? No — and that is part of why researchers are taking this seriously. In another benchmark using 143 *New England Journal of Medicine* clinicopathological conference cases, o1-preview got exact or very close diagnoses in 88.6% of cases, versus 72.9% for GPT-4. The model also did well on choosing diagnostic tests and management steps. ### So should hospitals replace doctors? No. The test was text-only. The model did not see the patient, read body language, notice distress, smell ketones on the breath, or pick up the thousand tiny bedside cues clinicians use without even naming them. The researchers’ point is not “fire the ER staff.” It is that AI second opinions may now be good enough to deserve real prospective clinical trials. ### Why does this feel different from earlier AI-in-medicine stories? Because older medical AI stories often lived on clean benchmarks or narrow imaging tasks. This one used messy real-world emergency-department documentation — the kind of fragmented record that usually breaks neat demos. That makes the result more believable, but also more uncomfortable, because it touches work many people assumed would stay stubbornly human for longer. ### What is the catch? The catch is deployment. A model can be right in a retrospective comparison and still be hard to use safely in a live hospital. Doctors are accountable for edge cases, communication, consent, and follow-through. An AI can narrow the clinical judgment. ### Bottom line? The big change is not that AI “beat doctors” in some abstract sense. It is that a reasoning model beat experienced ER physicians on a narrow but very real clinical task using actual hospital cases. That moves the conversation from hype to operations — where, exactly, you would trust a model, and where you absolutely would not.

AI beats ER doctors, 67% vs 50-55%

Get your own daily briefing