Block AI access to news archives

- Around 245 news organizations across nine countries are now blocking Internet Archive crawlers, trying to stop AI companies from mining archived journalism for model training. (euronews.com) - The pressure point is scale: at least 241 news sites block one Archive bot, and 87% of those sites are owned by USA Today Co. (niemanlab.org) - The fight matters because publishers are targeting an AI backdoor, but risk punching holes in the web’s public historical record. (niemanlab.org)

The fight is about archives, but the real target is AI training. News publishers are starting to block the Internet Archive’s crawlers because they think archived articles are becoming an(euronews.com)t is one of the main public records of what the web used to say. And over the past few months, that record has started to get thinner. (euronews.com) ### Why are publishers going after the Internet Archive? Because the Archive has become a kind of side door. A publisher can (niemanlab.org)ompany may still be able to pull them from there. Publishers see that as losing control twice — once on the open web, and again inside a nonprofit library. (niemanlab.org) ### Why is archived news so useful to AI? Archived journalism is unusually clean training data. It is dated, attributed, edited, and usually organized in machine-friendly ways(euronews.com)ting with metadata attached — exactly the kind of thing model builders want. (euronews.com) ### What changed this week? The scale of the blocking is getting clearer. Roughly 245 news organizations across nine countries are now trying to block Internet Archive crawlers, and more than 20 major out(niemanlab.org)It is turning into an industry move. (euronews.com) ### Who is doing most of the blocking? A huge chunk comes from USA Today Co. Most of the blocked sites in these counts are owned by that company, which means the effect is not just on a few national brands but on hu(euronews.com)y. (niemanlab.org) ### Are all publishers blocking in the same way? No — and that matters. The Financial Times blocks bots scraping paywalled content, including the Internet Archive. The Guardian has taken a narrower route, keeping some landing pages visibl(euronews.com)robots.txt conventions. So this is not one standard policy. It is a patchwork of defenses. (niemanlab.org) ### Is this really about copyright lawsuits? A lot of it is. Publishers are already suing AI companies over whether training on(niemanlab.org)al stays available through preserved snapshots. The catch is that the Archive itself is not the company building the chatbot. It is getting hit because it stores the material. (euronews.com) ### What gets lost if this keeps spreading? The web’s memory. Archived pages are often the only way to see what an article looked like before edits, remo(niemanlab.org), future readers may not just lose access to old stories — they may lose the ability to verify how public information changed over time. (eff.org) ### So what is the bottom line? Publishers are trying to stop AI companies from treating archives like a free database. Fair enough. But the tool they are cutting off (euronews.com)sk is historical amnesia. (niemanlab.org)

Block AI access to news archives

Get your own daily briefing