arXiv finds LLMs corrupt 25% of docs
- Microsoft Research authors Philippe Laban, Tobias Schnabel and Jennifer Neville posted an arXiv paper on April 17 saying today's LLMs corrupt documents in delegated workflows. - Their DELEGATE-52 benchmark spans 310 work environments across 52 professions, and even frontier models corrupted about 25% of document content after long runs. - The paper says tool use did not fix the problem, and larger files worsened it. (arxiv.org)
Large language models can look competent while quietly damaging the files they edit over time. A new arXiv paper from Microsoft Research says even top models corrupted about 25% of document content in long delegated workflows. (arxiv.org) The paper, “LLMs Corrupt Your Documents When You Delegate,” was submitted April 17, 2026 by Philippe Laban, Tobias Schnabel and Jennifer Neville. It studies what happens when a person hands off multi-step editing work and checks the result later, instead of reviewing every change. (arxiv.org) To test that, the authors built DELEGATE-52, a benchmark with 310 work environments across 52 professional domains, including coding, crystallography, genealogy and music notation. Each environment uses real documents totaling about 15,000 tokens and asks the model to complete 5 to 10 editing tasks. (arxiv.org) This is not a chatbot trivia test. It is closer to handing an assistant a stack of working files, asking for several revisions, and finding out later whether the files still make sense. (arxiv.org) The headline result is that current models “degrade documents during delegation,” with frontier systems including Gemini 3.1 Pro, Claude 4.6 Opus and GPT 5.4 averaging 25% corruption by the end of long workflows. Less capable models performed worse. (arxiv.org) The authors say the errors were often sparse but severe, which means a document could look mostly intact while key parts were broken. In their framing, that makes delegated work risky because the damage can compound across many steps before a human notices. (arxiv.org 1) (arxiv.org 2) The paper also reports that agentic tool use did not improve performance on DELEGATE-52. Corruption got worse as documents got larger, workflows got longer, or extra distractor files were added. (arxiv.org) That finding lands as companies push “agents” and “vibe coding” into office work and software workflows. The paper argues that task completion alone is not enough if the underlying files, diagrams or structured documents are being silently altered. (arxiv.org) The benchmark is text-only, but the examples cover files that represent visual or structured artifacts such as graph diagrams, textile patterns and 3D objects. The authors’ point is that a model can make a small textual change that ruins the thing the file describes. (arxiv.org) The paper is an arXiv preprint, not a peer-reviewed journal article, and its benchmark reflects the authors’ design choices. But its central claim is concrete: in long, real-document workflows, today’s LLMs can finish the job and still leave the documents corrupted. (arxiv.org)