Agent evals look brittle

Researchers and practitioners found major agent benchmarks can be gamed or broken through test leakage, brittle string matching, and insecure eval code, calling basic score validity into question. At the same time, people are warning about using LLMs as judges without careful prompts and rationale-first designs, because judge bias and verbosity distort results—ACL work even shows labelers are biased by an ‘AI vs human’ flag. Those two threads together imply benchmark scores alone aren’t a reliable proof point for agent capability. (techplanet.today) (x.com) (x.com)

Agent benchmarks are supposed to measure whether an artificial intelligence system can finish real tasks. New reporting and recent papers show many of those scores can be inflated by weak test design and biased judging. (techplanet.today) (aclanthology.org) An agent benchmark usually gives a model a task, tools, and a scoring script, then checks whether the final answer matches an expected result. The problem is that those checks can be brittle: a benchmark can fail a correct answer over formatting, or pass a bad run if the eval code is sloppy. (techplanet.today) A recent post circulated by researchers at the University of California, Berkeley’s Responsible Design Institute said they pushed top scores on major agent benchmarks by exploiting structural weaknesses rather than building better agents. The examples described three recurring failure modes: test leakage, fragile string matching, and insecure evaluation code. (techplanet.today) (aiproductivity.ai) Test leakage means the answers, or clues close to the answers, show up in training data or benchmark artifacts, so a model can look smart by memorizing. Fragile string matching means the grader rewards exact wording instead of actual task success. (techplanet.today) That leaves a second problem: many teams now replace human reviewers with a large language model judge, which is cheaper and faster but not automatically neutral. An Empirical Methods in Natural Language Processing 2024 paper found both humans and large language model judges were vulnerable to multiple biases and prompt-based attacks. (aclanthology.org) Those biases have names, but the plain-English version is simple. Judges can prefer the first option they see, reward longer answers for sounding more complete, or be nudged by labels and framing instead of underlying quality. (aclanthology.org 1) (aclanthology.org 2) An Association for Computational Linguistics 2025 paper called JUDGE-BENCH tested 11 current models across 20 natural language processing datasets with human annotations. The authors reported “substantial variance across models and datasets” and said large language models should be validated against human judgments before being used as evaluators. (aclanthology.org) Another 2025 Association for Computational Linguistics paper found human raters could not reliably tell human and artificial intelligence text apart in blind tests, yet preferred text labeled “Human Generated” over identical text labeled “AI Generated” by more than 30%, even when labels were swapped. That means a simple source tag can move a score before anyone reads for substance. (aclanthology.org) (arxiv.org) Researchers are also finding that adding more model judges does not automatically fix the problem. A Findings of the Association for Computational Linguistics 2025 paper reported that debate-style multi-agent judging amplified position, verbosity, chain-of-thought, and bandwagon biases after the first round, while meta-judge setups were more resistant. (aclanthology.org) The practical takeaway is narrower than “benchmarks are useless,” but harsher than “just use a better prompt.” A benchmark score can still be informative, but only if the task set is clean, the grader is robust, and the judging method has been checked against known bias and leakage failure modes. (aclanthology.org 1) (aclanthology.org 2) (techplanet.today) That is why recent benchmark numbers are getting a closer look. If the test can be gamed and the judge can be steered, the headline score says less about agent capability than it appears to say. (techplanet.today) (aclanthology.org)

Agent evals look brittle

Get your own daily briefing