Publishers pressure Internet Archive

Major publishers including The New York Times, The Washington Post and USA Today are pressing the Internet Archive over access to archived content that could be used to train AI systems. The SF Examiner reports the dispute frames archival repositories as potential collateral in publishers' battles with AI firms and raises questions about which public archives remain accessible for reuse and research. (sfexaminer.com)

Major publishers are cutting off the Internet Archive’s crawlers, turning a long-running web preservation tool into a new front in the fight over artificial intelligence training data. (sfexaminer.com) Nieman Lab reported on January 28 that publishers including The Guardian and Financial Times were limiting the Archive’s access because they feared artificial intelligence companies could use archived copies and application programming interfaces as a “back door” to their content. The Conversation reported on February 4 that The New York Times, Financial Times, The Guardian, and USA Today had confirmed they were ending the Archive’s access to their pages. (niemanlab.org) (theconversation.com) The Internet Archive says the Wayback Machine lets users “search the history of more than 1 trillion web pages,” and its blog said that milestone was reached on October 22, 2025. The nonprofit has been archiving the web since 1996. (archive.org) (blog.archive.org) The practical issue is simple: if a publisher blocks the Archive now, future versions of its stories may never be captured in the public record. The Electronic Frontier Foundation said on March 16 that archived pages are often the only reliable way to see when news articles were edited, changed, or removed. (eff.org) Publishers say the risk is not only preservation but reuse. Robert Hahn, The Guardian’s head of business affairs and licensing, told Nieman Lab that the Archive’s application programming interface was “an obvious place” for artificial intelligence firms to pull a large, structured database of articles, even though he said the Wayback Machine itself was less risky. (niemanlab.org) That pressure comes as publishers pursue separate copyright cases and licensing deals over artificial intelligence. The New York Times sued OpenAI and Microsoft in December 2023, and CourtListener shows the case was still active in the Southern District of New York as of April 6, 2026. (courthousenews.com) (courtlistener.com) The Archive enters this fight with weaker legal footing than it had a few years ago. The United States Court of Appeals for the Second Circuit ruled against the Internet Archive on September 4, 2024, in its book-scanning case with major publishers, and trade coverage said the case effectively ended in December 2024 when the Archive did not seek Supreme Court review. (law.justia.com) (publishingperspectives.com) The nonprofit also settled a separate copyright fight with major record labels in September 2025 over its Great 78 Project, which digitized old records. That left the Archive facing publishers’ new demands after losses in both books and music. (consequence.net) (sfexaminer.com) The dispute is narrowing a service that courts, journalists, and researchers have used for decades as a public memory of the web. The more publishers treat archives as potential training datasets, the less of today’s news may remain available to verify tomorrow. (eff.org) (archive.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.