Twitter thread shows 17.2x amplification
- Google Research and Google DeepMind researchers reported that independent multi-agent systems can magnify mistakes instead of fixing them in a January 28 post. - In controlled tests, unchecked agent networks amplified errors 17.2 times, while centralized coordination cut that figure to 4.4 times. - The paper says multi-agent gains depend on task type, with sequential work often getting worse. (research.google)
AI agents are software systems that reason, plan, and act across multiple steps, and Google researchers say chaining several together can sharply raise failure rates. (research.google) In a January 28, 2026 blog post, Yubin Kim and Xin Liu said their team tested five agent architectures and found that “more agents” often stopped helping. (research.google) The underlying idea is simple: one agent makes a choice, another agent builds on it, and a small mistake can spread through the chain like a bad number copied across a spreadsheet. (research.google) Their paper, “Towards a Science of Scaling Agent Systems,” evaluated 180 controlled configurations in the blog summary and 260 configurations in the arXiv paper version across six benchmarks and three language-model families. (research.google) (arxiv.org) The headline result was topology-dependent error amplification: independent agents working without a checking bottleneck amplified errors by 17.2 times, while centralized systems contained amplification to 4.4 times. (arxiv.org) (research.google) The same study found coordination was not universally bad. Centralized coordination improved performance by 80.8% on parallelizable financial-reasoning tasks, while decentralized coordination did better on dynamic web-navigation tasks. (arxiv.org) But the paper said every multi-agent variant they tested degraded performance on sequential reasoning tasks, with declines ranging from 39% to 70%. (arxiv.org) The researchers also built a predictive model using coordination metrics such as efficiency, overhead, error amplification, and redundancy. They said it identified the best architecture for 87% of held-out configurations. (research.google) (arxiv.org) That result lands into a broader debate over whether multi-agent systems reliably outperform a single strong model. A separate 2025 paper from researchers at the University of California, Berkeley and Stanford said gains on popular benchmarks were often minimal and cataloged 14 failure modes. (arxiv.org) For companies building agent pipelines, the paper points to a design choice rather than a scaling rule: add coordination where tasks can be split safely, and add verification where one bad handoff can poison the next step. (research.google) (arxiv.org)