Altman Defends Use of News Content for AI Training

OpenAI CEO Sam Altman has defended the practice of using news content to train AI models. The comments come as the company faces increasing legal and regulatory scrutiny over its data sourcing practices. The debate over fair use and copyright for training data remains a central issue for the AI industry.

- In parallel with Altman's defense of "fair use," OpenAI has actively pursued licensing agreements with numerous publishers, including the Associated Press, Financial Times, Axel Springer, and News Corp, which owns The Wall Street Journal. These deals grant OpenAI access to archives and current content, sometimes exclusively for certain languages, to train its models and feature in products like ChatGPT. - The New York Times filed a copyright infringement lawsuit against OpenAI and Microsoft, alleging the unlawful use of millions of its articles for training AI models. The suit claims that this practice threatens their business by creating a substitute product and seeks billions in damages, as well as the destruction of models trained on their data. - A former OpenAI researcher, Suchir Balaji, stated that during his time on the data collection team, he came to believe the company's methods of scraping internet data, including from behind paywalls and pirated archives, violated copyright law. A recent study also suggests that GPT-4o shows strong recognition of paywalled, copyrighted book content, indicating it was likely used in training. - Reinforcement Learning from Human Feedback (RLHF) is a key technique for aligning models, where human annotators rank different AI-generated responses to train a "reward model" that then guides the AI's behavior. This process requires significant high-quality human preference data to teach the model desired qualities like helpfulness and harmlessness. - An alternative to RLHF is Constitutional AI, an approach developed by Anthropic that uses a predefined set of principles or a "constitution" to guide the model. The AI critiques and revises its own responses based on these rules, reducing the need for extensive human labeling and making the alignment process more scalable and transparent. - While synthetic data can be generated quickly and at a lower cost to expand training sets, it often lacks the nuance and accuracy of human-labeled data, especially for complex or subjective tasks. A hybrid approach, using synthetic data for scale and human annotation for critical edge cases and quality assurance, is often considered the most effective strategy. - Evaluating agentic AI systems, which can perform multi-step tasks and use external tools, requires a shift from traditional metrics like accuracy. New benchmarks focus on assessing the entire process, including the coherence of reasoning, efficiency of tool use, and overall task success rate. - Venture capital investment in AI startups surged to $211 billion in 2025, an 85% increase from 2024, with about half of all VC dollars now going into the AI sector. AI infrastructure companies, including those focused on data, received 19% of this funding, while foundational model developers captured 40% of the total.

Altman Defends Use of News Content for AI Training

Get your own daily briefing