Models tend to guess, not ask

Researchers tested 22 models and found that when lacking visual context most models guessed answers instead of asking for clarification. The Decoder reports that nearly none of the models asked follow‑up questions, highlighting a common failure mode in current AI systems (The Decoder).

Multimodal artificial intelligence is supposed to answer questions about images, but a new Stanford-led study found many systems will invent details when no image is provided instead of asking to see it. (arxiv.org) The researchers built a test called Phantom-0 and prompted 22 frontier models across 20 categories with image-dependent questions while withholding the image. Across models, “mirage” behavior — answering as if the image existed — appeared more than 60% of the time, according to a Tech Xplore report on the paper published April 12, 2026. (techxplore.com) The paper, posted to arXiv on March 23 and revised on April 2, says models generated detailed descriptions and reasoning traces for images “never provided.” The authors include researchers from Stanford University’s electrical engineering, biomedical data science, computer science, psychiatry, and medicine departments. (arxiv.org) A multimodal model works like a chatbot with a camera attached: it should combine words with what it actually sees. The study argues that many current benchmarks still let models score well from text clues alone, even when the visual evidence is missing. (arxiv.org) That matters most in medical use, where the paper says one model reached the top rank on a standard chest X-ray question-answering benchmark without any image access. The authors say that result shows some benchmark questions can be answered from patterns in the wording rather than from the scan itself. (arxiv.org) The researchers also found a difference between silent guessing and admitted guessing. When models were explicitly told to guess without image access, performance dropped, which the paper says suggests the systems become more cautious when the missing context is stated outright. (arxiv.org) To address that, the team introduced a filtering method called B-Clean that removes questions compromised by text-only shortcuts. The paper says the method was applied to cleaned versions of MMMU-Pro, MedXpertQA-MM, and MicroVQA so evaluations better reflect actual visual understanding. (arxiv.org) The study is a preprint, not a peer-reviewed journal paper, and arXiv notes that papers on the site are not peer reviewed. Even so, the results add to a broader line of work showing that language-heavy systems often answer first and clarify later, if at all. (arxiv.org; arxiv.org) The opening failure is simple enough to describe without jargon: if a system cannot see, the safe move is to ask for the image. In this test, many of today’s best-known models acted as if they had already seen it. (techxplore.com; arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.