Xataka: ‘retro IA’ cutoff set to 1930
- Alec Radford, Nick Levine, and David Duvenaud released Talkie-1930, a 13B open model trained only on English text published before January 1, 1931. - The cutoff is the whole point: 260 billion pre-1931 tokens, plus an instruction-tuned version built from etiquette manuals, encyclopedias, and poetry. - It matters because AI copyright fights are intensifying, and this model tests a cleaner public-domain-only path for training.
A language model trained like it still lives in 1930 just showed up online. That sounds like a gimmick, but the real story is copyright, data hygiene, and a surprisingly neat AI experiment. The project is called Talkie-1930, and it comes from Alec Radford, Nick Levine, and David Duvenaud. The model was released in late April 2026 with a hard knowledge cutoff at December 31, 1930 — not because 1930 is magical, but because that line maps onto U.S. public-domain rules for older works. (github.com) ### What is this thing, exactly? Talkie-1930 is a 13-billion-parameter open-weight language model trained only on pre-1931 English-language text. The team also released an instruction-tuned version, so this is not just a research checkpoint sitting in a repo — you can actually run it or try a hosted chat demo. They also published a comparison model with the same architecture and training budget(github.com)ole project feel more like a controlled experiment than a novelty drop. (github.com) ### Why 1930 and not some other year? Because January 1, 1931 is the practical legal wall. In the U.S., works published before 1931 are in the public domain, so a pre-1931 corpus gives the team a much cleaner rights story than the usual “we scraped a lot of the internet and will argue later” approach. That does not solve every legal question everywhere, but it does sharply reduce one of the biggest fights around generative AI training data. (marktechpost.com) ### Is this mainly about copyright? Partly, but not only that. The other big idea is benchmark contamination. Modern models often train on piles of web text that may already contain test-set answers, summaries of later events, or endless restatements of the sam(marktechpost.com)m older text rather than what they accidentally absorbed from the modern web. (marktechpost.com) ### What did they train it on? A lot of old text — 260 billion tokens’ worth. The project description points to books, newspapers, periodicals, scientific journals, patents, and case law in the pre-1931 corpus. The instruction-tuned model then got extra shaping(marktechpost.com)ways of speaking and explaining. (marktechpost.com) ### So does it talk like a person from 1930? Pretty much — or at least like a mashup of the printed English available before 1931. The public demo openly warns that the model reflects the values and biases of those texts, not the beliefs of the creators, and th(marktechpost.com) all the blind spots that come with that. (talkie-lm.com) ### Why are people paying attention? Because the timing is sharp. AI companies are under growing pressure over whether training on copyrighted works counts as fair use, licensing, infringement, or something in between. Xataka’s angle makes sense here — Talkie-1930 looks like a concrete answer to a question hanging over the industry: what if you built a useful model from material with a much cleaner l(talkie-lm.com)bs will suddenly give up modern data, but it gives researchers and smaller builders a serious proof of concept. (xataka.com) ### What’s the catch? The catch is capability. A model that has never seen World War II, the internet, or modern science cannot be your general-purpose assistant for current reality. The team seems to know that. The point is not to beat the newest all-purpose chatbot. The point is to isolate what training data does to a model’s knowledge, style, and behavior — and to do it with far less copyright baggage. (decrypt.co) ### Bottom line? Talkie-1930 is a real model release, but it is also an argument. It says the cutoff date itself can be a design choice — technical, legal, and cultural at the same time. In an AI industry still fighting over what it was allowed to ingest, that is a much bigger idea than the retro voice. (github.com)