Paper finds 25% document corruption
- Microsoft Research researchers released DELEGATE-52, a benchmark showing large language models corrupt documents during long delegated editing workflows across 52 professional domains. - In tests of 19 models, even frontier systems averaged 25% document corruption after 20 interactions, and tool-using agents did not improve results. - The paper targets “delegated work” systems like vibe coding and file editing assistants. (arxiv.org)
Large language models can silently damage the files they are asked to edit over long, multi-step jobs, according to a new Microsoft Research paper. (arxiv.org) The paper, “LLMs Corrupt Your Documents When You Delegate,” introduces a benchmark called DELEGATE-52 for testing document editing across 52 professional domains, including coding, crystallography, genealogy, and music notation. (arxiv.org) (github.com) The basic problem is delegated work: a user asks a model to keep making changes to a file over many turns, then trusts the model not to break anything hidden inside the document. The benchmark simulates that setup with 310 work environments and files of about 15,000 tokens. (arxiv.org) Microsoft’s researchers tested 19 models and found that even frontier systems averaged 25% document corruption by the end of long workflows. The paper names Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 among the strongest models, but says they still corrupted documents over time. (arxiv.org) The failures were often sparse but severe. In the paper’s examples, graph diagrams, textile patterns, and 3D object files kept looking mostly intact while key structure, ordering, or formatting was lost. (arxiv.org) The benchmark measures whether a file still preserves the information and structure needed to be usable after repeated edits. That is different from asking whether a model answered a question correctly in one shot. (arxiv.org) (github.com) The paper also reports that agentic tool use did not improve performance on DELEGATE-52. Corruption got worse as documents grew larger, interactions got longer, or distractor files were added to the working context. (arxiv.org) The public GitHub release includes a redistributable subset with 234 work environments across 48 domains, because some seed documents in the full benchmark cannot be shared. (github.com) The result lands as software companies push “vibe coding” and other agent-style assistants that edit code, spreadsheets, and documents with limited human review. The paper argues that those systems still need checks that verify the final file, not just the model’s text explanation of what it changed. (arxiv.org) The closing point is simple: a model can complete each step plausibly and still leave the document broken by the end. DELEGATE-52 is meant to measure that failure before delegated editing becomes routine. (arxiv.org)