Stanford Study: AI Firms Train on Chats by Default
A Stanford HAI analysis of the privacy policies of top AI companies like OpenAI and Anthropic found that all of them train their models on user conversations by default. The study also revealed they merge chat data with other personal information, like search and purchase history, and retain it indefinitely, prompting calls for federal opt-in regulations.
The Stanford study is authored by Jennifer King, a Privacy and Data Policy Fellow at the university's Institute for Human-Centered AI (HAI). The research highlights that while some AI developers offer an opt-out, it's often buried in settings, leading to unintentional data sharing by users. This practice is becoming an industry standard as companies seek more data to improve their models, especially as publicly available English-language data from the internet becomes scarcer. The scope of data collection often extends beyond chat inputs to include uploaded files, with some companies retaining this information indefinitely. Furthermore, human reviewers may read transcripts of user conversations to help train the models. This creates a tiered system where paying enterprise customers often receive greater privacy protections by default than individual consumers using free versions of the same services. In the U.S., there is no single comprehensive federal law governing AI, creating a complex patchwork of state regulations and federal agency guidelines. This has led to calls from major tech companies for federal rules to preempt varying state laws. In contrast, the European Union's AI Act, which is being implemented in phases, takes a risk-based approach and could set a global standard, much like GDPR did for data privacy. Turkey is actively developing its AI regulatory framework, aiming to align with the EU's risk-based model. The Turkish Personal Data Protection Authority (KVKK) has issued guidelines clarifying that AI systems are fully subject to existing data protection laws, including at the training stage. In November 2025, the KVKK released specific guidance on the use of personal data in generative AI, emphasizing that publicly available data cannot be used for training without a proper legal basis. A draft AI law was also submitted to the Turkish Parliament in June 2024. To address these privacy concerns, researchers are developing privacy-enhancing technologies (PETs). Techniques like federated learning allow models to be trained on decentralized data without the raw data ever leaving the user's device. Differential privacy adds statistical noise to data to protect individual identities, while homomorphic encryption enables computation on encrypted data, ensuring its confidentiality. These technologies are becoming crucial for sectors like healthcare, where AI can be trained on sensitive patient data without compromising privacy.