Microsoft Fara1.5 tops web benchmarks

- Microsoft Research said on May 21 that its open-weight Fara1.5 browser-agent family outperformed rival systems on live-web browsing benchmarks. - Microsoft reported its 27B-parameter Fara1.5 model scored 72% on Online-Mind2Web, ahead of OpenAI Operator at 58.3% and Google Gemini 2.5 at 57.3%. - Microsoft said Fara1.5 models are available in 4B, 9B and 27B sizes, with deployment details posted on Microsoft Research and GitHub.

Microsoft Research said on May 21 that its new Fara1.5 browser-agent models outperformed OpenAI’s Operator and Google’s Gemini 2.5 Computer Use on a live-web benchmark that measures whether AI systems can complete tasks on real websites. The company said the top Fara1.5 model, a 27 billion-parameter system, scored 72% on Online-Mind2Web, a benchmark built around 300 tasks across 136 websites. Microsoft’s published comparison put OpenAI’s Operator at 58.3% and Google’s Gemini 2.5 Computer Use at 57.3%. The release adds to a growing contest over “computer use” agents, which are designed to click, type and navigate websites rather than answer prompts in a chat box. ### What exactly did Microsoft release on May 21? Microsoft Research said Fara1.5 is a family of three open-weight browser agents: Fara1.5-4B, Fara1.5-9B and Fara1.5-27B. The company said the models are designed for browser-based tasks such as comparing products, filling out forms and booking events, and that they build on its earlier Fara-7B work. (microsoft.com) The Microsoft post said the 9B model reached 63% on Online-Mind2Web, while the 4B version reached 57%. Microsoft described the family as practical to deploy on modest hardware and said the models were trained to ask for approval or clarification when needed. ### What is Online-Mind2Web measuring? Online-Mind2Web is a benchmark introduced in an academic paper that evaluates web agents on 300 tasks spanning 136 live websites. (microsoft.com) The paper said the goal was to test agents in a setting closer to how people use the web, after researchers found earlier benchmark results painted an overly optimistic picture of current systems. The benchmark’s creators said they also built an automated judging method that reached about 85% agreement with human judgment. (microsoft.com) That matters because web-agent evaluation is hard to keep stable when websites change, layouts move and tasks expire. ### How do OpenAI and Google describe their own browser agents? OpenAI introduced Operator in January 2025 as a research preview for Pro users in the United States, saying the system could use its own browser to type, click and scroll on the web. (arxiv.org) OpenAI said in a July 17, 2025 update that Operator had been integrated into ChatGPT as “agent mode,” and that the standalone Operator site would sunset. Google’s developer documentation says Gemini Computer Use lets developers build browser-control agents that act from screenshots by generating UI actions such as clicks and typed text. Google labels the capability a preview and says users should supervise important tasks and avoid using it for critical decisions, sensitive data or actions where serious errors cannot be corrected. (openai.com) ### Why are people focusing on the “open-weight” part? Microsoft said Fara1.5 is being released with open weights, unlike proprietary systems such as Operator and Gemini 2.5 Computer Use. In practice, that means outside developers can inspect, run and adapt the model weights more directly than they can with closed hosted products, though deployment terms still depend on the release channel and license. (ai.google.dev) Microsoft’s GitHub repository said on May 21 that a Fara1.5 agent harness was “coming soon,” and Microsoft’s product materials said the 9B model was available through Microsoft Foundry. Those details will matter for whether researchers can reproduce the benchmark claims and compare costs, latency and safety controls across systems. ### What should readers watch next? May 2026 benchmark claims in web agents are likely to be tested by outside researchers against the same Online-Mind2Web tasks and Microsoft’s own WebTailBench suite, which its repository says now includes a refreshed 609-task V2 split. (microsoft.com) Independent reruns will also show how much scores move as websites change and tasks are updated. (github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.