Microsoft finds agents fail long tasks
- Microsoft Research posted SocialReasoning-Bench on May 11, 2026, showing frontier AI agents can finish social tasks yet still fail to protect users’ interests. - In Microsoft’s marketplace simulation, agents accepted the first offer up to 93% of the time, while DELEGATE-52 showed frontier models losing 25% of content. - The gap matters because AI firms are selling agents for multistep office work, but Microsoft’s own tests say trust still breaks.
AI agents are getting pitched as digital coworkers — things that can manage your calendar, negotiate with vendors, and grind through long office workflows without much supervision. The problem is that “can do steps” is not the same as “acts in your interest.” Microsoft Research put fresh numbers on that gap this week, and the results are rough. In one benchmark, agents completed social tasks but still failed to reliably improve the user’s position. In another, long document workflows quietly degraded over time instead of staying intact. ### What did Microsoft actually release? Microsoft Research published a blog post on May 11, 2026 introducing SocialReasoning-Bench, a benchmark for agents acting on a user’s behalf in two realistic settings: calendar coordination and marketplace negotiation. The point is simple — if an agent is emailing other people for you, it needs to know when to push, when to reveal information, and when to hold back. Microsoft says current frontier models usually finish the task, but often leave value on the table instead of advocating effectively for the user. (microsoft.com) ### What is the failure mode here? The failure is not dramatic chaos. It is bland underperformance. The agent gets something done, but not the best thing for you. Microsoft measures both the final outcome and the quality of the process, including whether the agent shows due diligence. Even with explicit instructions to optimize for the user’s interest, performance stayed well below what Microsoft says a trustworthy delegate should achieve. (microsoft.com) ### Why is “social reasoning” the hard part? Because these tasks are adversarial in a soft way. Another person has different goals, private information, and incentives to steer the exchange. A good assistant has to model both sides at once. Microsoft frames this as a principal-agent problem — the same basic structure you get with lawyers, brokers, or advisors acting for a client. That is a much higher bar than filling a form or summarizing notes. (microsoft.com) ### How bad did the benchmark get? One detail jumps out. In Microsoft’s earlier simulated marketplace setup, agents accepted the first proposal they received up to 93% of the time without exploring alternatives. That is basically the opposite of negotiation. It is like sending someone to buy a car and having them say yes to the first number on the windshield. (microsoft.com) ### Is this only about negotiation? No — and that is the more important part. The Register tied the social benchmark to another Microsoft Research paper, “LLMs Corrupt Your Documents When You Delegate,” posted in April 2026. That work built DELEGATE-52, a benchmark spanning 52 professional domains and 19 models. The headline result: even frontier models such as Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupted an average of 25% of document content by the end of long workflows. (microsoft.com) Across all models, average degradation was 50%. ### Do tools fix it? Not much. The DELEGATE-52 paper says agentic tool use did not improve performance on that benchmark. The errors also got worse with bigger documents, longer interactions, and distractor files. So the usual answer — give the model more tools, more loops, more autonomy — does not solve the trust problem by itself. ### Are any domains actually ready? Barely. In DELEGATE-52, Microsoft’s researchers set a readiness bar of 98% or better after 20 interactions. (theregister.com) The only domain that cleared it was Python programming. Everything else fell short, and 80% of simulated conditions showed severe document corruption of at least 20%. ### Why does this matter right now? (arxiv.org) Because the market is moving the other way. Microsoft, Google, Anthropic, and OpenAI are all pushing agents deeper into work software and longer-running tasks. Microsoft’s own research is basically saying the wrapper has improved faster than the judgment. The systems can look competent while quietly making weak decisions or introducing sparse, nasty errors that compound over time. (theregister.com) ### Bottom line? Agents look useful for bounded tasks with tight review. But once the job turns into negotiation, delegation, or long-running knowledge work, trust is still the missing product. Microsoft’s new benchmarks matter because they stop asking whether agents can act — and ask whether they can act for you. (microsoft.com)