AI can 'hallucinate' medical images

Frontier AI models have been shown to produce detailed radiology‑style interpretations even when they never saw the actual X‑ray image, effectively inventing visual findings rather than extracting them. Commentaries are stressing that such models should be used as decision support — not substitutes for human interpretation — and governance must control how outputs are weighted in diagnosis. Those failures highlight why careful validation and explicit boundaries are needed before AI tools touch diagnostic decisions. (futurism.com) (cnbctv18.com)

# AI Can ‘Hallucinate’ Medical Images A radiology report is supposed to begin with a picture. A chest X-ray goes in, and a doctor describes what is actually visible on the scan: a collapsed lung, a patch of pneumonia, a broken rib, or nothing abnormal at all. The new problem is stranger than a normal artificial intelligence mistake. Some frontier models are now producing polished, radiology-style interpretations even when they were never shown the X-ray in the first place. Researchers call this “mirage reasoning,” because the system writes as if it saw an image that does not exist in its input. (arxiv.org) (futurism.com) That is different from a simple wrong answer. A normal error is like a student misreading a diagram on a test. Mirage reasoning is like a student confidently describing details from a diagram that was never printed on the page. (arxiv.org) The risk comes from how medical image interpretation works in real life. Radiologists do not just guess the most statistically likely disease from a sentence. They inspect shadows, edges, densities, symmetry, and small changes that can alter a diagnosis. If an artificial intelligence system invents those visual details, it can sound expert while skipping the one thing it was asked to do: look. (nih.gov) This is not the first warning sign. In July 2024, researchers at the National Institutes of Health reported that GPT-4V answered medical image quiz questions with high accuracy, but physician graders found that it often made mistakes when describing the image and explaining its reasoning, including cases where it still landed on the right final diagnosis. (nih.gov) That detail matters because medicine is not graded like a multiple-choice exam. In a clinic, a correct answer reached for the wrong reason can still be dangerous, because the explanation may guide follow-up tests, treatment choices, or the urgency of care. A system that sounds persuasive can push people to trust a diagnosis more than the evidence deserves. (nih.gov) (cnbctv18.com) The new arXiv paper goes further than the earlier National Institutes of Health result. The authors reported that frontier models generated “detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided,” which means the systems did not merely misread a scan; they acted as though a scan had been supplied when none had been given. (arxiv.org) (futurism.com) The authors argue that these systems can exploit clues hidden in the wording of a question, in common disease patterns, or in familiar benchmark formats. In plain English, the model may be solving the prompt the way a test-prep machine would solve an exam, by pattern-matching on context, instead of by actually interpreting the image. (arxiv.org) (futurism.com) That helps explain why benchmark claims around medical artificial intelligence can be misleading. A model can look impressive on public datasets if the questions, answer choices, or disease frequencies leak enough hints for a strong guess. But a real hospital case is messy, incomplete, and full of unusual edge cases that do not behave like internet trivia. (arxiv.org 1) (arxiv.org 2) One recent benchmark illustrates the gap. In the paper “Radiology’s Last Exam,” posted to arXiv in September 2025, board-certified radiologists reached 83 percent diagnostic accuracy on 50 expert-level imaging cases, while the best-performing artificial intelligence model in that study reached 30 percent. The authors concluded that frontier models still fall far short of radiologists on difficult cases and warned against unsupervised clinical use. (arxiv.org) None of this means artificial intelligence has no place in radiology. Hospitals already use machine learning tools for narrower jobs such as worklist prioritization, image quality checks, automated measurements, and decision support. Those uses are different from handing a general-purpose chatbot the role of a diagnosing clinician. (clinicalimaging.org) (link.springer.com) That distinction is exactly what many clinicians and commentators are now emphasizing. The safer view is that artificial intelligence can be a second set of eyes, a drafting assistant, or a triage tool, but not the final reader of a scan and not the decision-maker for a patient. CNBC TV18 summarized the current consensus bluntly on April 7, 2026: chatbots may be a complementary resource, but they should not be the decision maker on someone’s health. (cnbctv18.com) The governance question is becoming as important as the software itself. A recent Clinical Imaging commentary argues that clinical artificial intelligence should not be treated like ordinary software, because these tools can shape which scans are reviewed first, what findings get highlighted, and how downstream decisions are made. In practice, that means hospitals need rules for where the model is allowed to speak, how much weight its output gets, who can override it, and how failures are audited. (clinicalimaging.org) There is also a patient-side risk. Consumer-facing systems make it easy for someone with a cough, a fever, or a scanned report to ask for instant medical advice. If the model writes in the style of a radiologist, many users will assume it is doing radiology, even when it is only predicting plausible language from text patterns. That gap between style and substance is where false confidence grows. (arxiv.org) (cnbctv18.com) The deeper lesson is not that artificial intelligence is useless in medicine. It is that medicine punishes hidden shortcuts. A system that can invent an unseen X-ray finding is not ready to be trusted just because it sounds calm, technical, and precise. Before these tools touch diagnosis, they need validation on private and difficult cases, clear limits on where they can be used, and human experts who remain fully responsible for the final call. (arxiv.org) (nih.gov)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.