DeepSeek‑R1 hits 60% top‑1

- Critical Care published a June 6, 2025 study showing DeepSeek‑R1 beat unaided critical care residents on 48 difficult ICU diagnosis cases. - The headline numbers were 60% top‑1 accuracy for DeepSeek‑R1, 27% for residents alone, and 58% for residents using the model. - That matters because the model also cut median diagnosis time nearly in half, but it still was tested only as support.

A medical diagnosis model looks impressive when it gets trivia right. The harder test is whether it can help with messy ICU cases where the clues conflict and the clock is running. That is the setup here. A June 6, 2025 paper in *Critical Care* says DeepSeek‑R1 did better than unaided critical care residents on a small set of difficult cases — and, more importantly, residents using it got both faster and more accurate. ### What actually got tested? This was not a giant hospital rollout. The researchers pulled 48 challenging critical illness cases from the literature and recruited 32 critical care residents from tertiary teaching hospitals, splitting them into AI-assisted and non-AI-assisted groups. Each resident handled about three cases, basically naming the most likely diagnosis and building a plausible list of alternatives. ### Why is “top‑1” the number people care about? Top‑1 means the first diagnosis named was the correct one. That is the harsh version of the test. On that measure, DeepSeek‑R1 hit 60% — 29 of 48 cases. Residents working without AI got 27% — 13 of 48. Residents using AI got 58% — 28 of 48. So the striking part is no one matched the model’s standalone performance. ### Did it help doctors work faster? Yes — a lot, at least in this setup. Median diagnostic time fell to 972 seconds with AI assistance from 1,920 seconds without it. That is close to cutting the time in half. In plain English, the model did not just produce a stronger first guess. It seems to have shortened the search process for residents who were trying to reason through very hard cases. ### Was the model’s output any good beyond the first guess? The paper says yes. Reviewers gave DeepSeek‑R1 median scores of 4 out of 5 for completeness and 5 out of 5 for clarity and usefulness. Its differential-diagnosis quality score was also high — median 5 out of 5. That matters because in medicine the first answer can suggest a sensible workup. ### So is this “AI beats doctors”? Not really. The stronger reading is narrower and more interesting: AI assistance lifted residents toward the model’s level. The assisted group’s top‑1 accuracy, 58%, was far above the unassisted group’s 27%. That makes this look more like a copiloting result than an autonomy result. The win is human-plus-model, not model-instead-of-human. ### What are the catches? The sample was small. The cases came from published literature, not a live ICU. The residents were trainees, not senior attendings. And “difficult cases” can be selected in ways that flatter

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.