Browser agent benchmark claim

Browser Use claims it beat OpenAI, Google and Anthropic on a browser‑agent benchmark using a self‑fixing LLM+CLI approach — the assertion is getting buzz as platforms race to own agent workflows. The benchmark claim was posted publicly this week and is being discussed among developer circles. (x.com)

Browser Use published an open-source benchmark post on January 31, 2026 that it says is designed to compare LLMs on web‑automation tasks. (browser-use.com) The writeup and linked materials describe a curated suite of roughly 100 difficult web tasks pulled from existing sets such as WebBench and GAIA for head‑to‑head comparison. (nontrivial.ai) Browser Use’s public GitHub benchmark repo was updated this week to add a “bu‑ultra” result and an updated plot that shows a 78% score for that configuration. (github.com) The benchmark repo contains automation pieces (judge.py and an orchestrator) that run and grade tasks, and the project’s docs describe a persistent CLI daemon plus multi‑LLM interfaces used to run retries and scripted tool invocations. (github.com) Developer discussion and coverage have surfaced on forums and tech outlets — a Hacker News thread and hands‑on pieces on TechRadar and independent blogs have been debating the methodology and claims. (news.ycombinator.com) Independent leaderboards tell a different story: WebBench’s public leaderboard still lists Anthropic’s Sonnet CUA near the top (Sonnet 3.7 at 66.0%) and shows Browser Use Cloud at about 43.9%, underscoring that Browser Use’s in‑repo 78% reflects its own task set and evaluation pipeline rather than a single community standard. (webbench.ai)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.