Microsoft finds agents fail long tasks

- Microsoft Research introduced SocialReasoning-Bench on May 11, testing whether AI agents actually protect a user’s interests in scheduling and negotiation tasks. - The sharpest detail is this: in earlier Microsoft simulations, agents accepted the first proposal up to 93% of the time. - That matters because “agent” products are moving into email, calendars, and purchasing before they can reliably advocate for users.

AI agents are supposed to do more than finish chores. They’re supposed to act on your behalf. That is the whole sales pitch — not just “send the email,” but “represent me well.” Microsoft’s new SocialReasoning-Bench is a reality check on that promise, and the basic message is pretty blunt: current agents often complete the task, but they do a bad job protecting the user’s interests. ### What did Microsoft actually test? Microsoft’s benchmark looks at two very ordinary situations: calendar coordination and marketplace negotiation. In both, the agent is acting for a “principal” — basically, the human it represents — while dealing with another party that has its own goals and private information. The benchmark scores two things separately: whether the task got done at all, and whether the agent got a good outcome for the user. (microsoft.com) ### Why is that different from normal agent evals? Because most agent demos grade the easy part. Did the meeting get booked? Did a deal close? Those are completion metrics. But a human assistant who always books the meeting at your worst time, or always caves on price, is not a good assistant. Microsoft’s point is that “task completed” can hide a lot of failure. (microsoft.com) ### So how bad were the results? Bad in a very specific way. The models usually finished the workflow, but outcome quality lagged badly. In marketplace negotiation, the tested models settled at or near zero on outcome optimality — meaning they gave away almost all the available surplus to the counterparty. In calendar scheduling, they did better, but still tended to accept slots that favored the requester more than the principal. (github.com) ### Which models were in the test? The released experiment notes list GPT-4.1, GPT-5.4, Gemini 3 Flash, and Claude Sonnet 4.6 in the benchmark setup, with both unguided and more defensive system prompts. That matters because this is not a story about one weak model. Microsoft is describing a pattern across frontier systems. ### Didn’t prompting fix it? (github.com) Not really. Stronger instructions helped, but only a bit. Microsoft says GPT-5.4 saw the biggest gain, at +0.12 on outcome optimality, while GPT-4.1 and Gemini improved more modestly. That is useful, but it is nowhere near the jump you would need before trusting an agent to negotiate, schedule, or disclose information with real autonomy. ### Why do agents fail here? Because social tasks are the hard version of automation. The agent has to know what you want, infer what the other side wants, decide what to reveal, and push back when needed. Turns out current systems are much better at following a workflow than at sustained advocacy. Microsoft ties this to a broader principal-agent problem — the same basic issue law and economics have wrestled with for centuries. (github.com) ### What’s the most revealing example? The 93% number. Microsoft says agents in an earlier simulated multi-agent marketplace accepted the first proposal they received up to 93% of the time without exploring alternatives. That is not strategic behavior. That is closer to a nervous intern trying to end the conversation quickly. (microsoft.com) ### What should people take from this? Use agents for bounded work. Let them draft, route, summarize, check calendars, maybe prepare options. But keep humans in charge of long-running coordination, negotiation, and anything involving tradeoffs, loyalty, or disclosure. The gap Microsoft is pointing at is not “agents are useless.” It’s narrower and more important: agents can look competent step by step while still failing at the job you actually hired them to do. (microsoft.com)

Microsoft finds agents fail long tasks

Get your own daily briefing