Studies warn of LLM over‑reliance
Recent research from AFRL, Wharton and Princeton finds that people using large language models can narrow their analysis and become overconfident, effectively weakening human judgment in decision tasks. The studies recommend Pentagon monitoring and stronger governance around model use in decision-support systems (x.com status).
A large language model can make a wrong answer feel finished the way a polished spreadsheet can make a bad forecast look official. A 2025 Princeton-led study found that simply adding explanations made people rely more on answers whether those answers were right or wrong. (arxiv.org) That Princeton study tested 308 people in a controlled experiment and changed one thing at a time: explanations, source links, and inconsistencies. Source links and visible inconsistencies reduced reliance on wrong answers, but explanations alone increased trust in both good and bad ones. (arxiv.org) A second paper found the confidence problem runs inside the models too, not just in the people using them. In five large language models, researchers measured overestimates of correctness ranging from 20% to 60%. (arxiv.org) When humans used model advice on reasoning problems, their answers got more accurate, but their confidence rose even faster. The same paper says model input more than doubled the amount of overconfidence in human answers. (arxiv.org) Wharton researchers saw a similar pattern in classrooms, where the short-term gain came with a long-term cost. In a field experiment with nearly 1,000 high school math students, access to GPT-4 raised grades during practice, but students later scored 17% worse than peers when the tool was taken away. (papers.ssrn.com) That study compared a plain chatbot with a tutor version built to slow students down and push them to reason. The guarded version still improved practice performance and reduced the learning loss that came from using the model as what the paper calls a “crutch.” (papers.ssrn.com) This is why the military angle is getting attention. The Air Force Research Laboratory said in April 2025 that it built a secure sandbox to collect data on how large language models affect decision-making, and said role-specific training is necessary because access alone is not enough. (afrl.af.mil) The Pentagon is already trying to build tests for this class of tools before they spread deeper into planning workflows. In February 2024, the Chief Digital and Artificial Intelligence Office hired Scale AI to create a testing and evaluation framework for models that could support military planning and decision-making. (defensescoop.com) Put together, the warning is not that large language models are useless. The warning is that a fluent assistant can narrow the set of options a person considers, raise their confidence faster than their judgment improves, and leave them weaker when the assistant disappears. (arxiv.org 1) (arxiv.org 2) (papers.ssrn.com) The fix in all three strands of research is less “ban the tool” than “design the tool so humans keep thinking.” Show sources, surface contradictions, force reasoning steps, measure reliance in real deployments, and treat decision support systems the way aviation treats checklists: useful only if the pilot is still flying the plane. (arxiv.org) (afrl.af.mil)