Study finds LLM leaderboards statistically fragile
A new study warns that popular LLM ranking platforms and leaderboards are statistically fragile. Researchers found that model rankings are highly sensitive to outlier ratings and user errors, potentially misleading both engineers and enterprise buyers. The study advises platform operators to implement outlier screening and increase transparency around model comparisons.
- The study by researchers from MIT and IBM Research found that removing just two user ratings out of 57,477 on the Chatbot Arena leaderboard was enough to change the top-ranked model from GPT-4-0125-preview to GPT-4-1106-preview. - This fragility was observed across multiple platforms; flipping the top spot on Vision Arena required removing 0.094% of ratings, and Search Arena required 0.253%, according to senior author Tamara Broderick and her team. - In contrast, the MT-bench benchmark proved more robust, requiring the removal of 2.74% of its evaluations to alter the top rank. The researchers attribute this stability to its design, which uses 80 standardized multi-turn questions and expert annotators instead of crowdsourced preferences. - The two user ratings that flipped the Chatbot Arena ranking were outlier matchups where the top model, GPT-4-0125-preview, lost to significantly lower-ranked models: Vicuna-13b (rank 43) and Stripedhyena-nous-7b (rank 45). - Researchers noted the issue is not unique to AI and is seen in sports rankings; in historical NBA data, removing just 0.016% of games was enough to change the top-ranked team. The study's authors suggest mitigations like allowing users to specify a confidence level with their vote or having mediators review influential ratings. - This statistical fragility compounds other known issues with crowdsourced leaderboards, such as models being optimized for preferred style (e.g., longer, emoji-filled responses) rather than factual accuracy, a tactic that has been used to boost rankings. - Other analyses have revealed systemic flaws in platforms like Chatbot Arena, including preferential data access for proprietary model providers and the undisclosed deprecation of open-weight models, creating asymmetries in the data used for training and evaluation. - The study found no significant difference in fragility between leaderboards using human voters and those using an "LLM-as-a-judge" approach, indicating that simply automating the evaluation does not solve the underlying statistical instability.