MolmoWeb beats GPT-4o on multimodal benchmarks
MolmoWeb, an open multimodal web agent, claims SOTA performance and reportedly outperformed GPT-4o on benchmarks—showing browser automation and multimodal agents are advancing fast. That opens room for student projects that experiment with web-driven agents and automation. (x.com)
Ai2 published MolmoWeb on March 24, 2026, and released model weights, the MolmoWebMix dataset, inference code, and evaluation tools alongside a tech report. (allenai.org) MolmoWeb ships as Molmo 2 family variants in 4B and 8B parameter sizes and operates by observing screenshots instead of parsing HTML or accessibility trees. (allenai.org) The MolmoWebMix training mixture combines over 100K+ synthetic trajectories with 30K human demonstrations and accompanying GUI perception data used to train and evaluate the agents. (allenai.org) On live web-navigation benchmarks the MolmoWeb 8B reported 78.2% on WebVoyager, 42.3% on DeepShop, and 49.5% on WebTailBench, placing it as the top open-weight web agent across the evaluated tests. (aihola.com) Test-time scaling (multiple rollouts) pushed WebVoyager pass rates substantially higher—Byteiota reported a pass@4 jump to 94.7% for the 8B model during evaluation. (byteiota.com) Training used supervised fine-tuning (no reinforcement learning and no distillation from proprietary vision systems) on a cluster setup reported as 64 NVIDIA H100 GPUs, and the stack pairs Molmo language models with SigLIP2 vision encoders under the Molmo2 architecture. (the-decoder.com) The public GitHub repo includes an Apache‑2.0 LICENSE, scripts to download checkpoints (MolmoWeb-8B and MolmoWeb-4B are available on Hugging Face), and a reproducible inference server and evaluation harness. (github.com) The release notes and demos indicate Playwright is used for browser control and that a hosted demo is available but limited to whitelisted sites, while the repository documents backends (FastAPI/Modal) and example deployment scripts for local or cloud self‑hosting. (aihola.com)