Public Leaderboard Benchmarks AI Agent Performance
The MCPMark leaderboard now provides public, comparative benchmarking for AI agent server implementations. The leaderboard tracks metrics such as pass rates, token usage, and cost per run, offering transparency that helps developers evaluate the real-world performance and cost-effectiveness of different agent platforms.
- The MCPMark benchmark was collaboratively developed by EVAL SYS, LobeHub, and the National University of Singapore (NUS) to stress-test AI agents in realistic environments. - Unlike benchmarks that use simulated environments, MCPMark evaluates agents on 127 real-world tasks involving tools like PostgreSQL, GitHub, Notion, and the Playwright browser automation framework. - Tasks are designed to test a full range of operations beyond simple data retrieval, including creating a CI workflow, updating a database, or deleting project files. - The benchmark highlights the current limitations of even top-tier models; the best-performing model, gpt-5-medium, only achieved a 52.56% pass rate, while other prominent models fell below 30%. - Cost-effectiveness is a key metric, with the leaderboard revealing significant differences; for instance, the Qwen-3-Coder model delivered a success rate comparable to a leading competitor at roughly one-third of the per-run cost. - This public benchmarking arrives as the cost of running AI agents becomes a critical factor for businesses, with typical monthly expenses ranging from $1,000 to over $5,000, driven by API calls and token usage. - MCPMark is part of a broader ecosystem of specialized AI agent evaluations, which includes SWE-bench for resolving GitHub issues and WebArena for browser-based task completion. - The project is open-source (Apache-2.0 license) and uses a human-AI collaborative pipeline to design and refine its challenging tasks, each of which includes an automated script for objective verification.