ICLR paper goes viral

A viral social post highlighted an ICLR 2025 paper showing some LLMs fail simple reasoning when irrelevant clauses are added — suggesting the models pattern‑match rather than truly reason. (x.com) The shared example — models subtracting counts for 'smaller kiwis' despite irrelevance — is being used to caution against over‑trusting unverified outputs. (x.com)

A large language model can look smart on a math word problem and still break the moment you add one useless detail. That is the idea behind an International Conference on Learning Representations 2025 paper that has started spreading far beyond research circles after a viral social post highlighted one especially simple example. (arxiv.org) (x.com) The paper is called “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models,” and it was published as a conference paper at the International Conference on Learning Representations 2025. The authors are Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. (arxiv.org) (openreview.net) The setup is easy to understand. Researchers take grade-school-style math questions and generate many controlled variations, so they can test whether a model is actually following the logic or just recognizing familiar wording patterns. (arxiv.org) That matters because ordinary benchmark scores can hide a lot. A model can score well on a standard test set while still being brittle when the same underlying problem is rewritten with different numbers or extra sentences. (arxiv.org) The paper says the models showed “noticeable variance” even when only the numerical values changed. In plain English, that means two questions with the same structure could get very different answers just because the numbers were swapped. (arxiv.org) The more striking result came when the researchers added clauses that sounded relevant but did not actually matter to the solution. Performance dropped sharply as those extra clauses piled up. (arxiv.org) One result in the paper says that adding a single clause that appears relevant can cause performance drops of up to 65% across the state-of-the-art models they tested. The clause changes the story, but it does not change the math required to get the right answer. (arxiv.org) The kiwi example is the one now circulating online. A problem can mention that some kiwis were “smaller than average,” even though the question asks only about how many kiwis were picked, and some models still treat the size detail like it should be subtracted from the count. (techcrunch.com) (x.com) That mistake is revealing because it looks less like careful reasoning and more like autocomplete with confidence. The model sees a phrase that resembles details often used in school-word-problem deductions, then follows that pattern even when the detail is irrelevant. (arxiv.org) The authors say that current large language models may not be doing “genuine logical reasoning” and may instead be replicating reasoning steps seen in training data. They also cite prior work suggesting the process is closer to probabilistic pattern matching than formal reasoning. (arxiv.org) That does not mean large language models are useless at math or reasoning. It means a clean-looking answer is not proof that the model understood the problem in the same way a careful human would. (arxiv.org) It also helps explain a familiar user experience. A chatbot can solve five hard-looking questions in a row, then fail on a sixth question that a middle-school student would get right after crossing out one irrelevant sentence. (arxiv.org) The viral post is landing now because it turns an abstract research argument into a one-line warning. If a model can be distracted by “smaller kiwis,” then polished prose and confident formatting are not enough reason to trust an answer without checking it. (x.com) (arxiv.org)

ICLR paper goes viral

Get your own daily briefing