UCSD finds state media leaves chatbot traces

- UC San Diego researchers and collaborators published a Nature study on May 13 showing state media control can leave detectable traces in chatbot behavior. - The study examined 37 countries and found more than 3.1 million Chinese-language documents overlapping with state-linked phrasing in open training data. - The paper and UC San Diego release were published May 13, with co-authors spanning Oregon, Purdue, UC San Diego, NYU and Princeton.

UC San Diego researchers and collaborators published a Nature study on May 13 that links governments’ control of online media environments to measurable differences in how AI chatbots answer political questions. The paper says those effects show up in the training data that large language models learn from and can later be detected in model outputs. The authors tested models across 37 countries and paired that cross-national analysis with a China case study. The work was released alongside a UC San Diego summary and a broader university press package on May 13. ### How did the researchers say governments affect chatbots without touching the models directly? The Nature study said governments can influence chatbot behavior indirectly by shaping the web content that models ingest during training. The authors described that pathway as “institutional influence,” meaning political institutions alter information environments before model developers collect the data. (today.ucsd.edu) Hannah Waight, a co-first author and assistant professor of sociology at the University of Oregon, said the internet is not a neutral source for AI systems. Joshua Tucker, a co-author and co-director of the NYU Center for Social Media, AI, and Politics, said the study shifts attention “upstream” from what AI systems generate to the political conditions that shape training material. (today.ucsd.edu) ### What did the 37-country test actually find? The 37-country analysis focused on countries where a national language is largely concentrated within a single country, according to the university release and EurekAlert summary. The researchers found that models portrayed governments and institutions from countries with stronger media control more favorably when prompts were written in that country’s language than when the same topics were asked in English. (today.ucsd.edu) The six linked studies combined open training-data analysis, experiments with small models, human evaluation and tests of commercial chatbots, the authors said. That design was meant to trace the path from online media to training corpora and then to model behavior. ### Why does China figure so prominently in the paper? China was the paper’s main case study for showing how state-linked material can enter training datasets at scale. (eurekalert.org) Comparing two sources of Chinese state-coordinated media with a major open-source multilingual dataset derived from Common Crawl, the researchers found more than 3.1 million Chinese-language documents with substantial phrasing overlap. That represented about 1.64% of the dataset’s Chinese-language subset, according to the UC San Diego release and Nature search summary. (today.ucsd.edu) TechXplore’s repost of the release said that share was about 40 times the representation of Chinese-language Wikipedia documents in the same dataset. The paper used that example to argue that state-linked material can become a meaningful part of what multilingual models learn from. ### Did the paper say this shows up in commercial chatbots too? The authors said they tested “real-world” commercial chatbots as part of the six-study package. (today.ucsd.edu) The university release said the team found similar patterns in model responses about politics, especially when the prompts were written in a country’s own language. Nature’s summary described the central result more broadly: government-controlled media influences large language models through training data, and models queried in languages from countries with lower media freedom showed a more positive slant toward those governments. (techxplore.com) That finding was presented as an observed pattern in outputs, not as evidence of direct government access to proprietary models. (today.ucsd.edu) ### Who conducted the study and where was it published? The May 13 paper was published in Nature and involved researchers from the University of Oregon, Purdue University, the University of California San Diego, New York University and Princeton University. Named authors in the release materials included Hannah Waight, Eddie Yang, Yin Yuan, Solomon Messing, Margaret E. Roberts, Brandon M. Stewart and Joshua A. Tucker. (nature.com) UC San Diego published its account of the findings on May 13, and EurekAlert distributed a parallel release the same day under a Nature embargo that lifted at 11:00 a.m. U.S. Eastern time. The peer-reviewed article is listed on Nature’s site under the title “State media control influences large language models.” ### What comes next for readers who want the underlying paper? Nature listed the paper online on May 13, and UC San Diego’s release points readers to the full study and related examples released that day. (eurekalert.org) The next concrete step for outside researchers is likely to be scrutiny of the paper’s methods, datasets and country-level comparisons in the published Nature article and accompanying institutional materials. (nature.com)

UCSD finds state media leaves chatbot traces

Get your own daily briefing