Expose LLM scientific epistemic gaps
- JAMA Network Open published a 2026 benchmark showing 21 frontier large language models still falter on clinical reasoning, with differential diagnosis emerging as the weakest step across 29 standardized patient vignettes. - The study logged 16,254 responses and found PrIME-LLM scores from 0.64 for Gemini 1.5 Flash to 0.78 for Grok 4, despite stronger results on final diagnosis and management. - New April 2026 audits also found jailbreak weaknesses and mixed multimodal hospital performance, widening scrutiny of LLM reliability before clinical deployment. (arxiv.org)
Doctors do not start with one answer; they start with a shortlist. A JAMA Network Open study published in April 2026 found 21 frontier large language models were weakest at that early step, differential diagnosis. (jamanetwork.com) The researchers tested models including GPT-5, Claude 4.5 Opus, Gemini 3.0 Flash and Pro, and Grok 4 on 29 MSD Manual clinical vignettes updated in January 2025. The run produced 16,254 scored responses across five parts of the clinical workflow. (jamanetwork.com) Their headline metric, called PrIME-LLM, ranged from 0.64 for Gemini 1.5 Flash to 0.78 for Grok 4. Final diagnosis and management scored better than differential diagnosis, which lagged across the model set. (jamanetwork.com) Differential diagnosis is the part where a clinician keeps several plausible causes alive before ordering tests. The paper said current models “cannot yet be relied on for unsupervised patient-facing clinical decision-making” even as reasoning-tuned systems outperformed nonreasoning ones. (jamanetwork.com) A second April 2026 paper pushed on a different weakness: whether model internals can be steered into unsafe behavior. In “Breaking Bad,” researchers audited eight open-source models with interpretability tools that nudge hidden activations rather than just rewriting prompts. (arxiv.org) That audit found Llama-3.3-70B-4bt produced jailbroken responses on as many as 91% of harmful queries under one steering method and 83% under another. GPT-oss-120B was reported as robust to both approaches in the same test. (arxiv.org) A third April 2026 paper looked at multimodal hospital diagnosis, where models read a bundle of inputs such as scans, lab results, notes, and vital signs. The VALID study used 539 inpatient cases from a tertiary public hospital in South Africa and more than 10,000 evaluations across 10 frontier models. (arxiv.org) That team reported tightly clustered model performance, with less than 15% variation despite wide cost differences, and said adding radiology reports improved results by 6%. GPT-5.1 ranked first, followed by Gemini models, while output rates ranged from 65% to 100% because some systems could not handle every input package. (arxiv.org) The multimodal paper also found those models outperformed routine ward diagnoses on average diagnostic and safety scores in its dataset. But that result sits alongside the JAMA benchmark’s warning that early-stage reasoning remains unreliable, especially when the task is to keep multiple possibilities in play. (arxiv.org) (jamanetwork.com) Not every paper in this cluster is about medicine directly. A separate benchmarking study on training efficiency tested AdaptiveCycle and OneCycle learning-rate schedules across 11 deep learning models and UCR and UCE time-series datasets, reporting that AdaptiveCycle delivered the best overall performance. (bugiotti.it) Taken together, the new papers describe three different failure points: weak early clinical reasoning, model internals that can still be steered toward unsafe outputs, and uneven real-world multimodal deployment constraints. The common recommendation is tighter auditing before these systems are trusted in high-stakes care. (jamanetwork.com) (arxiv.org 1) (arxiv.org 2)