Grok 4.20’s Truth Claim

Elon Musk highlighted that Grok 4.20 now admits “I don’t know” far more often, claiming an 83% non-hallucination rate as a competitive advantage in truthfulness (x.com). The update signals growing vendor focus on refusal behaviours — systems that avoid fabricating answers — which changes how enterprises will assess model reliability for critical workflows (x.com).

Elon Musk is selling a strange brag for an artificial intelligence model: Grok 4.20 says “I don’t know” more often. He tied that behavior to an “83% non-hallucination rate” in a post on X, turning refusal into a product feature instead of an embarrassment. (x.com) A hallucination is when a chatbot fills in a blank with a confident guess, like a student inventing a citation instead of leaving the line empty. The benchmark Musk is pointing to rewards the opposite habit by giving no penalty when a model declines to answer. (artificialanalysis.ai) That benchmark is called Artificial Analysis Omniscience, and it uses 6,000 questions across 42 topics in six domains, including health, law, business, and software engineering. Its core score adds one point for a correct answer, subtracts one point for a wrong answer, and gives zero points for abstaining. (artificialanalysis.ai) The key trick is that the test separates raw knowledge from guessing behavior. Artificial Analysis says high-knowledge models can still score worse if they answer too aggressively instead of admitting uncertainty. (artificialanalysis.ai) On the live Omniscience leaderboard, Grok 4.20 0309 v2 has a 17% hallucination rate, which implies an 83% non-hallucination rate. The same page shows Grok 4.20 0309 also near the top on low hallucination, with a 22% hallucination rate. (artificialanalysis.ai) xAI is leaning hard into that claim in its own product pages. The company describes Grok 4.20 as its “newest flagship model,” says it has a 2,000,000-token context window, and says it combines “the lowest hallucination rate on the market” with strict prompt adherence. (docs.x.ai) The company is also pricing that reliability push like a mainstream application programming interface product, not a research demo. xAI’s developer docs list Grok 4.20 at $2.00 per million input tokens and $6.00 per million output tokens. (docs.x.ai) This is a different sales pitch from the one artificial intelligence companies used in 2023 and 2024, when the loudest claim was usually “best benchmark score.” Artificial Analysis currently shows Grok 4.20 at 48 on its broader Intelligence Index, behind leaders at 57, even while Grok stands out on the hallucination metric. (artificialanalysis.ai ) That gap tells you what xAI is optimizing for. A model can be less impressive on broad reasoning tests and still be more attractive for customer support, compliance drafts, or internal search if it is less likely to invent a policy, date, or document. (artificialanalysis.ai) The enterprise consequence is simple: buyers now have to ask two separate questions instead of one. The first is “How smart is this model,” and the second is “What does it do when it reaches the edge of what it knows.” (artificialanalysis.ai) That second question is getting easier to measure because refusal behavior is now visible in public benchmarks and in vendor marketing. When Musk highlights “I don’t know” as a win, he is signaling that the next competition is not just who answers the most questions, but who refuses the dangerous ones. (x.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.