NEJM trial finds LLM automation bias
- Ihsan Ayyub Qazi and colleagues reported on May 12 that a randomized NEJM AI clinical trial found erroneous LLM advice reduced physicians’ diagnostic accuracy. - In 44 AI-trained physicians, mean diagnostic accuracy was 84.9% with error-free recommendations and 73.3% with flawed ones, a 14-point adjusted drop. - The paper’s data-sharing statement says deidentified participant-level data and analytic code are available from the corresponding author with publication.
Ihsan Ayyub Qazi and colleagues reported in NEJM AI that erroneous large language model recommendations reduced diagnostic performance in a randomized clinical trial of physicians who had already completed AI-literacy training. The study, titled “Automation Bias in Large Language Model–Assisted Diagnostic Reasoning among Physicians Trained in AI Literacy,” was published by NEJM AI under DOI 10.1056/AIoa2501001, according to the journal’s data-sharing statement. The trial enrolled 44 physicians in Pakistan and randomly assigned 22 to receive unmodified ChatGPT-4o diagnostic recommendations and 22 to receive recommendations with deliberate errors in three of six clinical vignettes, the authors wrote in a preprint version of the study. All participants held MBBS degrees, were registered with the Pakistan Medical and Dental Council, and had completed a 20-hour AI-literacy program covering LLM capabilities, prompt engineering and critical evaluation of AI output. (ai.nejm.org) The main result was a 14.0 percentage-point adjusted decline in diagnostic reasoning accuracy among physicians exposed to flawed recommendations. Mean diagnostic accuracy was 84.9% in the control group that received error-free recommendations and 73.3% in the treatment group that saw flawed outputs, with a reported P value below.0001, according to the preprint. A second measure showed a larger gap in top-choice diagnoses. (medrxiv.org) The treatment group recorded top-choice diagnosis accuracy of 76.1% per case, compared with 90.5% in the control group, for an adjusted difference of 18.3 percentage points, the preprint said. Three blinded physicians scored responses using an expert-validated rubric that assessed differential diagnosis accuracy, supporting and opposing evidence, and recommended diagnostic steps. The study design was built to test automation bias under voluntary use rather than mandatory AI reliance. Physicians could choose whether to consult the offered ChatGPT-4o recommendations alongside conventional diagnostic resources, the authors said, making the trial a measure of whether doctors would adopt and propagate faulty AI suggestions even when they were not required to use them. The registration record shows the project was sponsored by Lahore University of Management Sciences, with Qazi listed as responsible party, and was registered as NCT06963957 on ClinicalTrials.gov. (medrxiv.org) A separate American Economic Association registry entry listed the trial under the title “Trust or Verify? Automation Bias in Physician-LLM Diagnostic Reasoning,” with an initial public posting on April 30, 2025, and identified co-investigators from Lahore University of Management Sciences, King Edward Medical University, Lahore General Hospital and Children’s Hospital, Lahore. NEJM AI has separately published work arguing that medical AI should be tested with the same kind of rigorous clinical evaluation used for other interventions. In a 2023 editorial launching the journal, Isaac Kohane wrote that AI in medicine should undergo randomized clinical trials despite the complexity of the technology. The next step is data access and follow-up scrutiny. NEJM AI’s data-sharing statement, posted on April 9, 2026, says deidentified participant-level data and analytic code will be available from the corresponding author, ihsan.qazi@lums.edu.pk, to researchers at academic institutions for non-commercial use under a signed data-use agreement. (clinicaltrials.gov) (ai.nejm.org 1) (ai.nejm.org 2)