Multimodal models prefer guessing

A benchmark of 22 multimodal models found most prefer to guess when visual information is missing instead of asking for clarification, and a reinforcement‑learning tweak helped models ask for help more often. The report highlights persistent uncertainty‑handling issues in multimodal systems and presents a potential training adjustment that nudges models to request missing context (the-decoder.com).

Multimodal artificial intelligence systems can read text and inspect images, but a March 19, 2026 benchmark found most still guess instead of asking for a better view. (arxiv.org) The benchmark, called ProactiveBench, tested 22 multimodal large language models on tasks where the image alone was not enough to answer correctly. The authors built it from seven existing datasets covering occluded objects, blurred images, coarse sketches, changing scenes, and other cases that required a user to provide missing visual context. (arxiv.org) In plain terms, the test asked whether a model would do what a person does when vision fails: ask someone to move an object, show another angle, or send a clearer picture. Instead, the paper says current systems usually stayed “reactive,” meaning they answered anyway, hallucinated, or refused rather than requesting the extra information they needed. (arxiv.org) The paper arrives as companies push image-and-text assistants into search, office software, phones, and robotics, where one bad visual guess can carry into the next step. The authors write that proactive behavior did not track model size, and that prompting models to be more proactive produced only small gains. (arxiv.org) The dataset itself is large enough to stress-test that behavior at scale. ProactiveBench includes more than 108,000 images arranged into 18,000 samples, and the researchers filtered out cases a model could already solve on the first try so the benchmark measured uncertainty handling, not just raw recognition. (developmentstoday.com) The authors also tested whether training could shift that habit. Their paper says a simple reinforcement-learning fine-tuning setup improved proactive behavior and generalized to scenarios the model had not seen during training. (arxiv.org) That training idea differs from a separate September 30, 2025 paper from the University of Amsterdam and the University of Edinburgh, which used reinforcement learning to make vision systems include missing details earlier so downstream reasoners would need fewer follow-up questions. That paper reported a 4.4-point average accuracy gain across seven visual math benchmarks and up to a 39 percent drop in clarification requests when those requests were allowed. (arxiv.org) Together, the papers point to the same weakness from opposite directions: multimodal systems still struggle to recognize when they cannot see enough. The new benchmark’s closing argument is narrower and more practical — if these models are going to work with people, they need to know when to stop guessing and ask for help. (arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.