Muses‑Bench tests multi‑user agents

A new paper formalized multi‑principal LLM agent problems and introduced Muses‑Bench to evaluate conflicts, privacy and coordination across users. The authors report even top models struggle—Gemini‑3‑Pro averaged about 85.6% on the benchmark—highlighting gaps for team‑scale agent reliability (x.com).

A new paper argues that large language model agents built for one user break down when they have to serve a group with clashing goals. (arxiv.org) The paper, “Multi-User Large Language Model Agents,” was posted to arXiv on April 11, 2026, by researchers from Stanford University, King Abdullah University of Science and Technology, the University of Toronto, and the Massachusetts Institute of Technology. It frames the problem as a “multi-principal decision problem,” where one agent has to balance multiple users with different roles, preferences, and authority. (arxiv.org) In plain terms, the setup is closer to a shared workplace assistant than a solo chatbot. One model may need to answer a manager, a teammate, and a client in the same task while keeping some information private and still finishing the job. (arxiv.org) The authors built a benchmark called Muses-Bench to test those situations through three stress cases: instruction following under conflicting requests, privacy preservation when users hold private context, and coordination when the agent has to gather information across people over multiple turns. The project code says the benchmark lives inside the `muses_bench` package and includes evaluators for access control and instruction following. (arxiv.org) (github.com) The paper says current frontier models show “systematic gaps” on all three fronts. In the authors’ summary, models often failed to keep a stable priority order when user goals conflicted, leaked more private information as conversations got longer, and slowed down when coordination required repeated back-and-forth. (arxiv.org) That is a different failure mode from the usual single-user benchmark miss. A model can look strong on math, coding, or question answering and still mishandle a team workflow where access rules, hidden context, and rank order matter as much as raw reasoning. (arxiv.org) (ai.google.dev) The timing lines up with a broader push by model companies to turn chat systems into agents that plan, use tools, and act across longer tasks. Google’s current Gemini agent documentation describes agents as systems that combine models, tools, and reasoning to execute multi-step work, which is exactly the kind of setting this paper says remains under-tested for multiple users. (ai.google.dev) (arxiv.org) The authors present the work as the first systematic study of multi-user large language model agents, not just multi-agent systems where several models talk to each other. Their focus is one agent serving several humans at once, with unequal permissions and incomplete information. (arxiv.org) (github.com) The repository linked from the paper includes benchmark code, data folders, and a paper directory labeled `ICML2026_multi_user_LLM.pdf`, pointing to a planned conference submission. For now, the central result is narrower and more practical: shared assistants still struggle when “helpful” for one person can mean disloyal, leaky, or inefficient for another. (github.com) (arxiv.org)

Muses‑Bench tests multi‑user agents

Get your own daily briefing