AI Models Surpass Human Benchmarks

State-of-the-art AI models are increasingly matching or surpassing human performance on complex tasks, according to the 2025 Stanford Index. The report highlights new benchmarks like MMMU for multi-modal reasoning and GPQA for graduate-level questions. The results show intensifying global competition in AI development between the U.S., China, and Europe.

The Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark tests AI on college-level problems across 30 subjects, from art to engineering. It uses a mix of text and 30 types of complex images like charts, diagrams, and even music sheets to evaluate a model's ability to perceive, know, and reason at an expert level. Models are also being tested against the Graduate-Level Google-Proof QA (GPQA) benchmark, a set of questions in biology, physics, and chemistry so difficult they cannot be answered by simply searching online. On this benchmark, even skilled non-experts with internet access only achieve 34% accuracy, while domain experts reach 65%. The cost of developing these frontier AI models has escalated dramatically. Google's Gemini 1.0 Ultra is estimated to have cost $192 million to train, while Meta's Llama 3.1-405B cost around $170 million. Anthropic's CEO, Dario Amodei, noted that while some models cost $100 million, others in training are approaching a $1 billion price tag. In 2024, the United States produced 40 notable AI models, significantly more than China's 15 and Europe's 3. However, the performance gap is shrinking, with Chinese models reaching near-parity on key benchmarks like MMLU and HumanEval. Total corporate investment in AI reached $252.3 billion in 2024, with private investment in the U.S. hitting $109.1 billion. This is nearly 12 times the private AI investment seen in China ($9.3 billion) and 24 times that of the U.K. ($4.5 billion). Despite the rapid advances, there is still significant room for improvement. Top models like Google's Gemini Ultra and GPT-4V have only achieved accuracies of 59% and 56%, respectively, on the challenging MMMU benchmark. The competition at the cutting edge is intensifying, with the performance difference between the top-ranked and 10th-ranked models narrowing from 11.9% to 5.4% in just a year. The top two models are now separated by a slim 0.7% margin. This progress is becoming more accessible as costs for AI inference are dropping. The cost for a system at the level of GPT-3.5 fell by a factor of over 280 between late 2022 and late 2024. Additionally, open-weight models are increasingly closing the performance gap with their closed-source counterparts.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.