Multimodal models bluffing
A study testing 22 multimodal models found that when visual input was missing, most models guessed or fabricated answers instead of asking for clarifying information. A simple reinforcement-learning approach in the tests suggested models can be trained to ask for missing context rather than hallucinate. (the-decoder.com)
Multimodal artificial intelligence models are built to read text and images together, but a 2025 study found they often bluff when the visual part is missing or flawed instead of saying they need more information. (arxiv.org) The study, posted on arXiv on June 1, 2025 and revised on August 25, 2025, tested six multimodal large language models, including o3 and GPT-4o, on prompts with missing objects, contradictory facts, ambiguous references, and infeasible actions. The authors wrote that these systems “frequently fail to surface hidden issues” even when the needed perception and reasoning skills appear to be present. (arxiv.org) A multimodal model works like a chatbot with a camera attached: it turns pixels and words into tokens, then predicts the next token as an answer. When the image does not actually support the request, the safest move is to ask a follow-up question or point out the mismatch, not to guess. (arxiv.org) That gap has become more important as companies push image-capable models into search, shopping, software agents, and robotics, where a wrong answer can trigger a bad action. A 2025 survey on multimodal hallucination described the problem as outputs that look plausible but are not grounded in the visual input. (arxiv.org) The 2025 underspecification paper found that explicit prompting changed behavior: asking models to act cautiously or requiring a clarifying question sharply improved results. The authors said the issue looked less like a total lack of capability and more like a bias toward complying with the user’s request. (arxiv.org) That pattern matches other recent work showing that vision models often struggle not just with seeing, but with using what they saw during reasoning. A January 2026 paper from Dartmouth College and the University of Central Florida reported that converting images into text descriptions improved performance on visual reasoning tasks by 26.7 percent for Claude 3.5 and 23.6 percent for Claude 3.7. (arxiv.org) Researchers have also been testing reinforcement learning, a reward-based training method, to make multimodal systems pause and ask for help when the situation is unclear. An April 2025 paper from Georgia Institute of Technology and Meta FAIR said its method trained multimodal models in ambiguous environments to ask clarification questions and beat zero-shot baselines on a task called Ask-to-Act. (arxiv.org) The broader research record points in the same direction: these models often have the pieces needed to notice a problem, but they do not reliably choose that behavior on their own. The next round of model training is likely to focus less on making answers sound confident and more on getting systems to admit when they cannot really see. (arxiv.org)