Government AI Chatbots Prone to 'Waffling'

A new study found that AI chatbots often hallucinate or "waffle" when required to give concise answers to government and public queries. Most major models also failed to refuse inappropriate requests. The findings underscore the challenge of creating safe, bounded conversational AI, especially for unpredictable questions from children.

- The study from the Open Data Institute (ODI), which created the "CitizenQuery-UK" benchmark, tested 11 large language models (LLMs) against a dataset of over 22,000 questions derived from the UK's official GOV.UK website. - Among the models evaluated were those from the Llama, Claude, Gemini, Qwen, and OpenAI families, including specific versions like Anthropic's Claude-4.5-Haiku and Meta's Llama 3.1 8B. - A primary issue identified was verbosity; models often produce "word salad" responses that bury factual information, with Anthropic's Claude 4.5 Haiku noted as being particularly verbose. - When instructed to be more concise, the accuracy of the LLMs tended to decrease, and they were more likely to introduce information from outside of the authoritative government sources. - The study found that models rarely refused to answer a query, a characteristic the researchers labeled as a "dangerous trait" because it increases the likelihood of users acting on incorrect information. - A specific example of a factual error was from Llama 3.1 8B, which incorrectly stated that a court order was necessary to add an ex-partner's name to a child's birth certificate. - Another significant error came from ChatGPT-OSS-20B, which provided the dangerously incorrect advice that a person is only eligible for Guardian's Allowance (a benefit for those caring for a child whose parents have died) if the child in their care has also died. - The research also pointed out that smaller, more cost-efficient LLMs often delivered comparable results to the larger, closed-source models like OpenAI's ChatGPT 4.1.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.