DeepSeek‑R1 hits 60% vs residents 27%

- A June 2025 Critical Care study tested DeepSeek‑R1 on 48 hard ICU cases and found the model beat unassisted critical care residents on top diagnoses. - DeepSeek‑R1 got 60% top-diagnosis accuracy, while residents without AI got 27%; residents with AI support rose to 58% and worked about twice as fast. - That matters because the bottleneck is shifting from raw model skill to safe deployment, oversight, and liability inside real clinical workflows.

Diagnostic AI is getting past the “cute demo” stage. A June 2025 study in *Critical Care* put DeepSeek‑R1 on 48 difficult critical illness cases and compared it with critical care residents. The headline number is real: the model got the top diagnosis right 60% of the time, while residents without AI support got 27%. But the more important detail is that residents using the model climbed to 58% and finished much faster — which makes this look less like “AI replaces doctors” and more like “AI becomes a very strong second set of eyes.” (link.springer.com) ### What was actually tested? This was not a broad test of everyday medicine. The researchers used 48 diagnostically difficult critical illness cases pulled from the literature and recruited 32 critical care residents from tertiary teaching hospitals. Sixteen residents worked without AI help, and 16 worked with DeepSeek‑R1 assistance. Separately, the model itself was also scored on (link.springer.com)esidents on a narrow, hard benchmark — not “AI beats all doctors at diagnosis.” (link.springer.com) ### Where do the 60% and 27% come from? Straight from the paper. DeepSeek‑R1 reached 60% top-diagnosis accuracy, or 29 out of 48 cases. The non-AI resident group got 27%, or 13 out of 48. The AI-assisted resident group got 58%, or 28 out of 48. That last number is the one people should linger on, because it says the model’s value may be less about acting alone and more about lifting clinician performance toward the model’s own level. (link.springer.com) ### Did it only help accuracy? No — it also helped speed. Residents with AI support had a median diagnostic time of 972 seconds, versus 1,920 seconds without AI. That is basically a cut in half. In ICU settings, that matters. Not because faster is always better, but because critically ill patients are exactly where delayed recognition can hurt most. The model’s answers also scored well for completeness, clarity, and usefulness on clinician ratings. (link.springer.com) ### So is DeepSeek‑R1 now “doctor level”? Not really. Benchmarks in medicine are messy, and performance depends a lot on the task. In *Nature Medicine* in April 2025, DeepSeek‑R1 did well on some medical tasks — 0.92 accuracy on USMLE questions and 0.57 on one set of text-based case challenges — but those are very different from real bedside diagnosis. A 2026 *JAMA Network Open* stu(link.springer.com)t part of the workflow across models. Basically, the models are strong, but early-stage clinical reasoning is still the hard part. (nature.com) ### Why isn’t this already everywhere? Because deployment is not just a capability question. It is a workflow and liability question. The FDA’s clinical decision support framework draws a line between software that clinicians can independently review and higher-risk tools that function more like regulated devices. If a model suggests a diagnosis, a doctor follows it, and the patient is harmed, respo(nature.com)s, and vendors. That uncertainty makes health systems cautious even when the benchmark numbers look impressive. (fda.gov) ### Why does the “co-clinician” framing matter? Because it fits both the data and the regulation. The best result in this study was not autonomous AI. It was human-plus-AI. That lines up with how the FDA treats many lower-risk decision-support tools — the clinician is supposed to be able to review the basis for the recommendation rather than blin(fda.gov) It is a diagnostic copilot that helps residents and physicians miss fewer things, faster. (link.springer.com) ### What’s the bottom line? The striking part of this story is not that DeepSeek‑R1 hit 60%. It is that, on a hard ICU benchmark, unassisted residents were far lower and AI-assisted residents nearly matched the model. That is the shape of adoption to watch. The technical question is starting to look answerable. The real fight now is over trust, oversight, and who owns the mistake when the machine is wrong. (link.springer.com)

DeepSeek‑R1 hits 60% vs residents 27%

Get your own daily briefing