Training data becomes a cost
Legal fights over whether models can be trained on copyrighted material are turning training data into a commercial expense rather than a free input. A Princeton Legal Journal overview shows lawsuits from newspapers, authors and image libraries are forcing model developers to consider licensing or risk litigation, and industry leaders have begun forming an AI licensing group to negotiate permissions. That shifts enterprise procurement: buyers must now ask vendors about data provenance and indemnities, not just raw model performance. (legaljournal.princeton.edu)
For years, artificial intelligence companies treated training data like air: scrape first, ask later. In 2026, that assumption is colliding with lawsuits from The New York Times, authors, and image companies, and the bill for “free” data is starting to show up in contracts. (legaljournal.princeton.edu) The core fight is simple: a model learns patterns by ingesting huge volumes of text, images, and audio, and many of those files belong to someone. A Princeton Legal Journal overview says the old defenses were fair use and Section 230 of the Communications Decency Act, but both are now being tested in court instead of accepted in the market. (legaljournal.princeton.edu) The newspaper case is the cleanest example. The New York Times sued OpenAI and Microsoft in December 2023, alleging that millions of Times articles were used without permission to build products that can now compete with the paper for readers and subscriptions. (courthousenews.com) Authors made the same argument from the book side. In Authors Guild v. OpenAI, writers said their books were copied into training sets without consent, turning copyrighted novels into raw material for a machine that can imitate style and summarize plots on demand. (law.justia.com) Image libraries attacked the same practice from another angle. Getty Images sued Stability AI, and a United Kingdom High Court judgment published on November 4, 2025 became one of the first major rulings anywhere to examine how copyright law applies to training generative artificial intelligence systems. (judiciary.uk) While the courts grind forward, the market is already changing. OpenAI signed licensing deals with The Associated Press in July 2023, Axel Springer in December 2023, the Financial Times in April 2024, and News Corp in May 2024, each one turning publisher archives from a free scrape target into a paid input. (ap.org) (openai.com 1) (openai.com 2) (openai.com 3) The same thing happened in images. Shutterstock said in July 2023 that it had expanded its partnership with OpenAI under a new six-year agreement to provide image, video, music, and metadata for model training. (prnewswire.com) A licensing system is now being built around that reality. In April 2025, Publishers’ Licensing Services and the Authors’ Licensing and Collecting Society said they were developing a collective generative artificial intelligence licence through the Copyright Licensing Agency, with launch planned for the third quarter of 2025. (thebookseller.com) (ppa.co.uk) That changes who has leverage. If a publisher or photo library can either sue you or license to you, then training data starts to look less like free fuel and more like cloud computing or electricity: a recurring cost that can be negotiated, bundled, and priced by quality. (legaljournal.princeton.edu) (copyright.gov) The United States Copyright Office said in its May 2025 report that pre-training uses orders of magnitude more data and computing power than later stages, and it also said licensing copyrighted works for training is already feasible because licensing is already central to many content industries. (copyright.gov) (authorsguild.org) Now the procurement problem lands on buyers. Stanford researchers reviewing legal technology contracts in March 2025 found that only 33 percent of artificial intelligence vendors provided indemnification for third-party intellectual property claims, which means many customers could be left exposed if the vendor’s training data turns out to be contested. (law.stanford.edu) So the new enterprise checklist is less about benchmark scores alone. A serious buyer now has to ask where the training corpus came from, what licenses cover it, whether the vendor will defend copyright claims, and whether the contract blocks the vendor from training on the buyer’s own proprietary files. (law.stanford.edu) (cloud.google.com)