Small LLMs hit verbal confidence saturation

- A new April 2026 preprint found seven open-weight instruction-tuned language models, from 3 billion to 9 billion parameters, failed a verbal-confidence validity screen. - On 524 TriviaQA questions across 8,384 deterministic trials, every instruct model was marked invalid, with a mean confidence ceiling rate of 91.7%. - A March review of 246 radiology LLM papers found most omitted key reproducibility details, tightening calls for audit standards. (springer.com)

Large language models can answer a question and then state how sure they are. A new April 2026 preprint says many small open models are bad at that second part. (arxiv.org) Jon-Paul Cacioli tested seven instruction-tuned open-weight models in the 3 billion to 9 billion parameter range across four model families. He ran 524 TriviaQA items under numeric and categorical confidence prompts, producing 8,384 deterministic trials on consumer hardware. (arxiv.org) The paper’s main result was blunt: all seven instruct models were classified invalid on numeric confidence. Their mean ceiling rate was 91.7%, meaning the models clustered at the top end instead of spreading confidence in a way that separated right answers from wrong ones. (arxiv.org) In plain terms, “confidence saturation” means a model keeps sounding sure even when the answer quality changes. Cacioli writes that minimal verbal elicitation failed to preserve internal uncertainty signals at the output interface in this model-size regime. (arxiv.org) The study also found that switching from a 0-to-100 confidence scale to 10 categorical bins did not fix the problem. In six of seven models, that change pushed answer accuracy below 5%. (arxiv.org) That matters because verbal confidence is often used as a cheap proxy for whether a model “knows” it might be wrong. The paper argues those signals should be psychometrically screened before anyone uses them downstream. (arxiv.org) A separate April 2026 preprint from Johns Hopkins University makes a similar point in psychiatry, where language models were tested on hospitalization risk scoring. In 50 synthetic patient profiles, adding medically insignificant details increased both predicted risk and output variability across all four audited models and all prompt styles. (arxiv.org) Radiology has a parallel problem: researchers are publishing fast, but often without enough detail to reproduce what they did. A March 16, 2026 systematic review in *Insights into Imaging* examined 246 radiology LLM studies and found only 27.6% reported model version, 41.1% shared full prompts, and 16.7% reported temperature settings. (pmc.ncbi.nlm.nih.gov) (springer.com) Radiology groups have already been warning that large language models are not ready to replace specialist judgment. An American Journal of Roentgenology expert panel wrote in 2024 that privacy, transparency, and accuracy still limit clinical readiness, and that radiologists remain responsible for report content. (ajronline.org) That leaves a narrower takeaway than the hype around “self-aware” models suggests. Small models may still answer usefully, but when they say how sure they are, the number can flatten into style instead of measurement. (arxiv.org)

Small LLMs hit verbal confidence saturation

Get your own daily briefing