Multimodal LLM safety issues

- Benchmarking work finds multimodal LLMs underperform on high-stakes tasks like inpatient diagnosis and audio safety. - Some benchmarks show failure rates above eighty percent on clinical inpatient diagnosis tasks. - The papers also highlight audio-model degradation and jailbreak divergences, stressing the need for realistic, latency-aware testing ( ).

Multimodal artificial intelligence systems can see, hear, and read at once, but new benchmarks show they still miss basic safety checks in medicine and audio. (arxiv.org, arxiv.org) These systems combine text with images, scans, or speech so one model can answer questions about an X-ray, a lab panel, or a spoken prompt. That wider input range also opens new failure modes that text-only tests do not catch. (arxiv.org, arxiv.org, omni-safetybench.github.io) One April 21, 2026 paper tested 10 frontier multimodal large language models on 539 inpatient cases from a tertiary public hospital in South Africa, using clinical notes, lab results, vital signs, radiology reports, and imaging. The authors said expert panels adjudicated 300 of those cases and used the results for more than 10,000 model evaluations. (arxiv.org) That study reported mean diagnostic performance below 15% across all 10 models, with results tightly clustered despite a 50-fold cost range between systems. Routine ward diagnoses were used as a real-world clinical comparator rather than a synthetic baseline. (arxiv.org) Audio models are showing a parallel problem. A benchmark paper on audio jailbreaks said existing work had focused mostly on text and images, then introduced a dataset and benchmark for large audio-language models that tested whether spoken or edited audio could bypass safety rules. (arxiv.org) A separate NeurIPS 2025 workshop paper tested 1,900 adversarial prompts across harmful content, chemical, biological, radiological, and nuclear material, and child sexual exploitation material on seven frontier vision-language and audio-language models. The authors said models with 0% text-only attack success could still exceed 75% attack success after simple image or audio transformations, and one Llama-4 variant reached 89%. (openreview.net) Another benchmark, OmniSafetyBench, expands that problem to systems that handle image, audio, and text together. Its authors built 23,328 evaluation samples across 24 modality variations and reported that only 3 of 10 tested models scored above 0.6 on both overall safety and cross-modal consistency. (omni-safetybench.github.io) The common finding across these papers is that safety tuned for plain text does not reliably carry over when the same request is hidden in a scan, a picture, or a waveform. The benchmarks also measure consistency across modalities, because a model that refuses a typed request but answers a spoken version creates a new attack path. (openreview.net, omni-safetybench.github.io, arxiv.org) The medical paper also points to a deployment problem: hospital use depends on accuracy, safety, and cost at the same time, not benchmark scores in isolation. The authors evaluated real inpatient records from a public hospital, which makes the results harder to dismiss as lab artifacts. (arxiv.org) More multimodal models are being shipped into products that answer with voice, inspect documents, and interpret images, but the newest safety papers are testing them under messier conditions closer to real use. The gap they describe is not whether models can process more kinds of input, but whether they stay reliable when those inputs matter most. (arxiv.org, omni-safetybench.github.io, openreview.net)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.