Copilot teams up LLMs

Microsoft updated Copilot and Researcher AI to run multiple LLMs in the same workflow so GPT can draft and Claude can critique—creating an explicit multi‑model audit trail for each step. The change surfaces which model handled what and embeds critique for traceability and faster debugging, a notable shift toward multi‑model orchestration in enterprise apps. (geekwire.com)

Microsoft rolled out the Researcher updates — including the new Critique and Council capabilities — on March 30, 2026 as part of Microsoft 365 Copilot’s Researcher agent. (techcommunity.microsoft.com) Critique explicitly separates generation from evaluation, assigning a “generator” role to plan retrieval and produce a draft and a distinct “reviewer” role that uses rubric‑based evaluation to strengthen accuracy, structure and citations. (techcommunity.microsoft.com) Microsoft reported that Researcher with Critique improved DRACO benchmark performance by +7.0 points (a +13.88% gain versus the best system in the cited paper) across 100 complex tasks spanning 10 domains, with the evaluation linked to Zhong et al.’s Feb. 2026 arXiv study. (techcommunity.microsoft.com) Council surfaces multiple model responses side‑by‑side and generates a cover letter that highlights where models agree, where they diverge, and the unique signals each model contributes to a result. (techcommunity.microsoft.com) Anthropic Claude models were added earlier as selectable options in Copilot (Claude Sonnet 4 and Claude Opus 4.1), require tenant admin opt‑in, are hosted outside Microsoft‑managed environments under Anthropic’s terms, and can be enabled from the Microsoft 365 admin center. (microsoft.com) Microsoft’s multi‑model traces — model role metadata plus embedded reviewer commentary and the Council cover letter — create a more explicit lineage for each research step, but enterprise audit posture still depends on tenant logging and retention settings (standard audit retention is 90 days unless extended via paid Purview options). (geekwire.com) Academic and research work from Microsoft Research (VeriTrail) underscores the need for provenance and error‑localization across multi‑step generative pipelines, offering methods to detect where hallucinations are introduced in chained model workflows and informing how platform teams should instrument step‑level trace logs. (microsoft.com)

Copilot teams up LLMs

Get your own daily briefing