OpenAI model beat ER doctors in study

- An OpenAI reasoning model outperformed emergency‑room physicians on diagnostic and clinical‑reasoning tasks in a real‑world study published in Science. - NPR and STAT reported the Harvard Medical School/Beth Israel Deaconess‑linked trial showed superior diagnostic accuracy and management decisions for the model. - Authors and commentators caution the model is a tool for second opinions, not a replacement for clinicians, per NPR and STAT. (npr.org) (statnews.com)

The news here is not “AI passed another medical exam.” It’s more specific than that. A Harvard Medical School and Beth Israel Deaconess team put OpenAI’s o1-preview reasoning model through the kind of messy, text-heavy clinical tasks doctors actually do — reading incomplete charts, forming a differential, deciding what to test next, and making triage calls in real emergency-department cases. In a paper published April 30 in *Science*, the model often beat physician baselines, including on 76 real ER cases from Boston. ### What exactly did the model do? It did clinical reasoning tasks. That means more than naming a disease. The model had to look at symptoms, labs, imaging notes, and chart fragments, then suggest likely diagnoses and next management steps. The study also tested whether the model could handle different moments in care — from early triage, when information is sparse, to later admission decisions, when the picture is fuller. ### Why is that different from old medical-AI demos? Because a lot of earlier wins were on cleaner benchmarks. Multiple-choice exams are useful, but they flatten the hard part of medicine — uncertainty. Real charts are messy. Important facts are buried. Some clues point the wrong way. The Harvard team’s point was basically that older tests were hitting the ceiling, so they built harder ones that look more like actual clinical work. ### How strong was the result? Strong enough to get people’s attention. On the real ER cases, o1-preview was judged to match or exceed expert attending physicians across the three decision points the researchers tested, and it was especially strong at the earliest triage stage, when the least information was available. On 143 *New England Journal of Medicine* clinicopathological conference cases, the model was also evaluated for differential diagnosis and testing plans, using the same setup across cases from 2021 through September 2024. ### Why would AI be good at this? Because this is a pattern-recognition-and-synthesis problem with lots of text. Reasoning models are built to spend more compute working through a problem before answering. In medicine, that can matter a lot. The hardest part is often not choosing between two obvious diagnoses — it’s remembering the third possibility nobody thought of. That’s where these systems seem to be getting unusually good. ### So should patients trust a bot over a doctor? No — that’s the catch. These were retrospective and simulated evaluations, not live deployment studies where the AI was actually directing care for patients in real time. The model saw chart data, not the room. It could not examine a patient, notice distress, weigh preferences, or take responsibility for a risky call. The researchers themselves argue that the next step is controlled prospective clinical trials, not replacing clinicians. ### What are doctors still doing that the model isn’t? A lot. Medicine is not just diagnostic ranking. Doctors talk to patients, catch contradictions, manage fear, explain tradeoffs, and make ethical judgments under uncertainty. One outside researcher’s pushback was useful here — “clinical reasoning” in this setup is not the same thing as moral reasoning or bedside judgment. The model may be better at some narrow cognitive tasks without being a better doctor in the full human sense. ### Why does this matter now? Because it shifts the argument. The question is less “can AI pass medical tests?” and more “where in workflow should it sit?” A second-opinion tool in triage, diagnosis support for rare cases, or a chart-review assistant all look more plausible after this paper. But each use case carries different risks, and the wrong rollout could create overtrust fast. ### Bottom line? This looks like a real capability jump. An OpenAI reasoning model beat doctors on several hard clinical reasoning tasks, including real ER cases. But the result is best read as a warning and an opportunity at once — medicine may have a very strong new copilot, not a replacement pilot.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.