NPR: AI outperformed ER doctors
- Harvard and Beth Israel researchers said an OpenAI reasoning model beat emergency physicians on text-based diagnosis and care decisions in a Science study. - In 76 real Boston ER cases, o1-preview got 67.1% right at triage, versus 55.3% and 50.0% for two attending doctors. - The result matters, but the catch is big: this was retrospective chart reasoning, not autonomous bedside care. (science.org)
Emergency medicine is messy. Patients show up with partial stories, bad data, and symptoms that point in five directions at once. That’s why this new result lands so hard — a Harvard and Beth Israel team says an OpenAI reasoning model beat ER doctors on diagnosis and management tasks using real emergency-department cases, with the paper published April 30 in *Science*. But the news is not “AI ca(science.org)ossed a threshold people have been waiting on for decades. (science.org) ### What did the model actually beat doctors at? Not bedside medicine. Not physical exams. Not talking to scared patients. The study tested whether a model could read the kind of messy chart material doctors use — triage notes, vitals, lab data, evolving records — and then produce a differential diagnosis, suggest tests, and make care-management calls. In the real-world ER slice of the study, researchers(science.org) compared the model with attending physicians at multiple decision points. (science.org) ### How much better was it? The headline number is triage. On those 76 ER cases, OpenAI’s o1-preview hit 67.1% diagnostic accuracy, while two attending physicians scored 55.3% and 50.0%. That is a real gap, not statistical dust. The broader paper also says the model outperformed earlier non-reasoning systems like GPT-4 across several clinical reasoning benchmarks, including hard clinicopathologic cases that have been used for decades as a kind of gold-standard stress test. (letsdatascience.com) ### Why is triage the hard part? Because triage is the fog-of-war version of medicine. You have the least information right when the stakes are highest. A patient might have chest pain, but that can mean reflux, anxiety, pneumonia, or a heart attack. The doctor has to build a ranked list of possibilities before the picture is clear. That is exactly where reasoni(letsdatascience.com)m as new clues arrive. (science.org) ### So is this “doctor replacement” territory? No — and even the people pushing the work are saying that. The study used historical and simulated cases, not live patient care. A model can look strong when the task is “read the chart and reason,” but real medicine also includes noticing a patient’s breathing, catching a weird smell, seeing that someone is deteriorating, and deciding when the chart is wron(science.org)istaken for proof of safety or efficacy in treating real patients. (statnews.com) ### Then what’s the practical use? Second opinions, basically. A model that is good at structured differential diagnosis could act like an always-on backstop — surfacing possibilities a rushed clinician might miss, suggesting tests, and organizing evidence. That is also where Google DeepMind is aiming with its newly announced “AI co-clinician” work: not autonomous care, but AI operating under physician authority, with doctors retaining judgment and control. (deepmind.google) ### Why are companies emphasizing augmentation? Because medicine is a liability-heavy, trust-heavy field. If an AI hallucinates in a chatbot, that’s annoying. If it hallucinates in an ER, that can kill someone. DeepMind’s framing is telling — it talks about “triadic care,” where AI helps patients and clinicians but stays under expert supervision. In its own primary-care evaluations, the company highlighted evidence synthesis and low critical-error rates, not autonomy. (deepmind.google) ### What changes now? The next fight is not benchmark bragging rights. It’s prospective clinical trials. Researchers are basically saying the models are now good enough that hospitals should test them in controlled workflows and see whether they actually reduce missed diagnoses, improve outcomes, or just create new failure modes. That is a much harder bar — but it’s the only one that matters. (harvardmagazine. ([deepmind.google)gnosis-harvard-study)) ### Bottom line? This looks like a real step forward in clinical reasoning, not a stunt. But the useful version of the future is not an AI doctor replacing the ER team. It’s a doctor with a very strong machine second opinion — and a lot of guardrails. (science.org)