Scale AI Launches Human-Evaluated Leaderboard

Published by The Daily Scout

What happened

Scale AI has launched the SEAL Showdown, a new leaderboard that ranks LLMs based on human evaluations of performance on real-world prompts. The move represents a shift in the industry away from synthetic benchmarks toward more user-centric and outcome-based model evaluation.

Why it matters

- The SEAL Showdown leaderboard is based on millions of conversations from Scale's global network, encompassing users from over 100 countries, 70 languages, and 200 professions. - It allows for demographic segmentation, enabling users to filter model rankings by country, age, education level, language, and profession to see how models perform for specific user groups. - The ranking methodology uses the Bradley-Terry model to determine scores, which is augmented with style controls to account for confounding factors like response length, use of Markdown, and loading times. - To prevent models from being trained on the evaluation data, Scale AI does not sell or license data from the same distribution as the live leaderboard for a period of 60 days. - This initiative is a direct competitor to other evaluation platforms like LMArena, with Scale AI arguing that existing benchmarks often rely too heavily on hobbyist participation and narrow user groups, which can skew results. - In addition to the public Showdown, Scale also operates SEAL (Safety, Evaluations, and Alignment Lab) Leaderboards that use private, curated datasets to rank models on specific capabilities like coding, math, and instruction following. - Initial domain-specific leaderboards from SEAL showed OpenAI's GPT models ranking first in coding, instruction following, and multilingual capabilities, while Anthropic's Claude 3 Opus ranked highest in reasoning. - The system aims to combat issues like benchmark overfitting and contamination, where models are trained specifically to perform well on known tests but fail in real-world applications.

Key numbers

  • - The SEAL Showdown leaderboard is based on millions of conversations from Scale's global network, encompassing users from over 100 countries, 70 languages, and 200 professions.
  • To prevent models from being trained on the evaluation data, Scale AI does not sell or license data from the same distribution as the live leaderboard for a period of 60 days.
  • Initial domain-specific leaderboards from SEAL showed OpenAI's GPT models ranking first in coding, instruction following, and multilingual capabilities, while Anthropic's Claude 3 Opus ranked highest in reasoning.

What happens next

  • The system aims to combat issues like benchmark overfitting and contamination, where models are trained specifically to perform well on known tests but fail in real-world applications.

Quick answers

What happened in Scale AI Launches Human-Evaluated Leaderboard?

Scale AI has launched the SEAL Showdown, a new leaderboard that ranks LLMs based on human evaluations of performance on real-world prompts. The move represents a shift in the industry away from synthetic benchmarks toward more user-centric and outcome-based model evaluation.

Why does Scale AI Launches Human-Evaluated Leaderboard matter?

The SEAL Showdown leaderboard is based on millions of conversations from Scale's global network, encompassing users from over 100 countries, 70 languages, and 200 professions. It allows for demographic segmentation, enabling users to filter model rankings by country, age, education level, language, and profession to see how models perform for specific user groups. The ranking methodology uses the Bradley-Terry model to determine scores, which is augmented with style controls to account for confounding factors like response length, use of Markdown, and loading times. To prevent models from being trained on the evaluation data, Scale AI does not sell or license data from the same distribution as the live leaderboard for a period of 60 days. This initiative is a direct competitor to other evaluation platforms like LMArena, with Scale AI arguing that existing benchmarks often rely too heavily on hobbyist participation and narrow user groups, which can skew results. In addition to the public Showdown, Scale also operates SEAL (Safety, Evaluations, and Alignment Lab) Leaderboards that use private, curated datasets to rank models on specific capabilities like coding, math, and instruction following. Initial domain-specific leaderboards from SEAL showed OpenAI's GPT models ranking first in coding, instruction following, and multilingual capabilities, while Anthropic's Claude 3 Opus ranked highest in reasoning. The system aims to combat issues like benchmark overfitting and contamination, where models are trained specifically to perform well on known tests but fail in real-world applications.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Published by The Daily Scout - Be the smartest in the room.