NEJM trial finds 14% diagnostic drop

- Researchers in Pakistan reported a randomized trial showing that wrong ChatGPT-4o advice pulled physicians off course, cutting diagnostic reasoning scores despite AI training. - Forty-four doctors got 20 hours of AI-literacy training, yet flawed suggestions still dropped reasoning scores from 84.9% to 73.3% overall. - That matters because earlier trials hinted LLM help could boost doctors in some settings, but this one shows bad advice can reverse that.

Medical AI is supposed to help doctors think better and faster. But this trial lands on the harder problem — what happens when the model sounds helpful and is wrong. In a randomized study of 44 physicians in Pakistan, exposure to flawed ChatGPT-4o recommendations cut diagnostic reasoning scores by 14 percentage points versus error-free recommendations. Even more striking, every doctor in the study had already completed 20 hours of AI-literacy training. ### What exactly did the trial test? This was not a general “is AI good at medicine?” study. The researchers asked a narrower question — if physicians can choose whether to consult an LLM, do they still get pulled toward wrong answers when the model gives bad advice? Doctors were randomized 1:1 to solve 6 clinical vignettes in 75 minutes. One group saw normal ChatGPT-4o recommendations. The other saw recommendations with deliberate errors inserted into 3 of the 6 cases. (medrxiv.org) ### Who were the doctors? They were 44 physicians recruited from multiple medical institutions in Pakistan, all registered with the Pakistan Medical and Dental Council and all holding MBBS degrees. The important detail is that these were not AI-naive clinicians. Each participant had completed a 20-hour training program covering LLM capabilities, prompt engineering, and how to critically evaluate model output. (medrxiv.org) ### How big was the drop? Pretty big. Mean diagnostic reasoning accuracy was 84.9% in the error-free group and 73.3% in the flawed-advice group. After adjustment, that came out to a 14.0 percentage-point decline, with a reported P value below.0001. The secondary outcome moved the same way — top-choice diagnosis accuracy fell from 90.5% to 76.1%, an adjusted drop of 18.3 percentage points. ### What does “automation bias” mean here? (medrxiv.org) Basically, it means people start outsourcing part of their judgment to the machine. Not because they are lazy or unskilled, but because a confident recommendation changes the path of thought. In medicine, that is dangerous for an obvious reason — a bad suggestion does not just sit there. It can anchor the doctor, narrow the differential, and make contradictory clues easier to miss. The trial was designed to measure exactly that effect. ### Why is this result uncomfortable? Because AI training alone did not solve the problem. The study population had already been taught how to use and question LLMs, but the bias still showed up. That pushes against the comforting idea that a short safety course will make clinical use robust. Turns out the harder part is not knowing that models err. It is resisting a polished answer in the moment, under time pressure, while trying to finish the case. (clinicaltrials.gov) ### Does this mean LLMs are useless in diagnosis? No — and that is the catch. Another randomized study from the same research group, involving 58 physicians in Pakistan, found that LLM access improved diagnostic reasoning by 27.5 percentage points versus conventional resources alone. A separate 2024 JAMA Network Open trial found no significant improvement for physicians using an LLM compared with conventional resources, even though the LLM by itself outperformed both physician groups. (medrxiv.org) So the picture is mixed: models can help, but help is fragile and heavily dependent on setup and output quality. ### So what should hospitals take from this? The obvious lesson is not “ban AI.” It is “don’t deploy it as if a plausible answer were a safe answer.” If an LLM enters clinical workflow, it probably needs guardrails that force verification, surface uncertainty, and make disagreement easier rather than harder. Otherwise the interface becomes a very persuasive nudge generator. ### Bottom line This trial matters because it turns a vague worry into a measured effect. (medrxiv.org) Wrong AI advice did not just fail to help — it made trained physicians worse. In medicine, that is the threshold where “assistant” stops sounding reassuring and starts sounding like a systems-design problem. (medrxiv.org) (clinicaltrials.gov)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.