Big Tech AI Trains on Private Chats By Default
A Stanford analysis reveals that major AI firms like OpenAI and Google use private user conversations for model training by default, with no clear timelines for data deletion. The report highlights a "two-tiered privacy system" where enterprise users are automatically protected, while regular consumers must navigate confusing opt-outs. This practice is becoming a critical privacy flashpoint for consumer products.
The Stanford study analyzed the privacy policies of six major U.S. AI developers: Amazon, Anthropic, Google, Meta, Microsoft, and OpenAI. It found all six use consumer chat data by default to train their models, with researchers noting these companies represent nearly 90% of the American chatbot market. This default data collection for consumers stands in stark contrast to the protections offered to enterprise clients. Business and enterprise accounts with services like OpenAI's ChatGPT and Google's Gemini are typically opted-out of model training by default, with data handling governed by contractual agreements that prohibit its use for improving the models. Data retention timelines for consumers who don't opt-out are often vague or lengthy. For users who do opt-out of training, Anthropic deletes data within 30 days, but for those who don't, it's kept for five years. OpenAI and Meta provide no specific time limits for data deletion, raising concerns about the long-term risks of data exposure. Opting out of this data collection is possible but requires users to navigate through settings. For ChatGPT, users must go into "Data Controls" and disable the option to "Improve the model for everyone." Similarly, Google Gemini users can turn off "Gemini Apps Activity" to prevent future data from being reviewed by humans, though previously collected data may be retained for up to three years. The study also raised concerns about the collection of data from minors. Four of the six companies, including Meta and OpenAI, permit users as young as 13 and appear to include their chat data in model training, which could conflict with child privacy protection laws. This issue extends beyond just chat inputs. The practice of "web scraping," or automatically extracting massive amounts of data from the public internet, has been a primary method for training AI models. This has led to numerous lawsuits alleging that personal and copyrighted information was used without consent. The lack of clear federal regulation in the U.S. has created a complex patchwork of state-level laws, such as the California Consumer Privacy Act (CCPA). This contrasts with Europe's General Data Protection Regulation (GDPR), which imposes stricter requirements for a lawful basis for processing personal data for AI training. Public trust in AI companies to handle data responsibly is declining. This growing consumer awareness, coupled with increased regulatory scrutiny worldwide, is putting pressure on AI developers to adopt more transparent and privacy-preserving practices.