Deep learning’s origin story
James Rosen‑Birch posted a concise thread tracing the jump from shallow neural nets to modern deep learning, pointing to the 2012 image‑net breakthroughs and the later rise of transformer architectures as the turning points. (x.com) His post is a fast primer if you want the historical anchors that explain why today’s models scale so differently than earlier approaches. (x.com)
A neural network spent decades as a clever idea that usually lost to simpler methods. Then, in 2012, one image contest turned it from an academic side road into the main highway of artificial intelligence. (proceedings.neurips.cc) The basic trick is easy to picture: instead of writing rules for every cat ear, wheel, or face, you build a stack of adjustable filters and let data tune the filters for you. Each layer picks up a slightly bigger pattern than the one before it, like going from edges to corners to eyes to whole objects. (awards.acm.org) Early versions of that idea were called shallow neural networks because they used only a few layers. They could learn small patterns, but they struggled to build the long chain of intermediate steps needed for hard tasks like recognizing 1,000 object categories in messy real photos. (awards.acm.org) Researchers kept the idea alive through the 1980s, 1990s, and 2000s, even when most of the field favored hand-built features and other machine-learning methods. In 2018, the Association for Computing Machinery gave Geoffrey Hinton, Yann LeCun, and Yoshua Bengio the A.M. Turing Award for the conceptual and engineering breakthroughs that made deep neural networks practical. (awards.acm.org) What changed was not one magic equation but a combination of scale. Bigger labeled datasets, faster graphics processing units, and training tricks that reduced overfitting finally gave deeper networks enough room and enough compute to show what they could do. (proceedings.neurips.cc) The turning point most histories use is ImageNet, a giant image dataset tied to an annual competition called the ImageNet Large Scale Visual Recognition Challenge. In the 2012 contest, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered a deep convolutional neural network later nicknamed AlexNet. (proceedings.neurips.cc) AlexNet trained on about 1.2 million images across 1,000 categories and posted a top-five error rate of 15.3 percent. The second-best entry scored 26.2 percent, a gap so large that it looked less like a routine win than a regime change. (proceedings.neurips.cc) That result mattered because earlier computer-vision systems usually depended on humans deciding which visual features to extract first. AlexNet learned its own internal features from raw pixels, which meant the same basic recipe could improve as data and computing power improved. (proceedings.neurips.cc) The next big shift came in language rather than images. In 2017, researchers at Google introduced the transformer, a model architecture built around attention, which is a way for the model to decide which other words in a sequence matter most for the word it is processing now. (research.google) Before transformers, many language systems read text step by step, like a person moving one bead at a time on an abacus. The transformer dropped that sequential bottleneck, letting training run in parallel across many positions while still modeling long-range relationships between words. (research.google) The original transformer paper reported 28.4 BLEU on the 2014 English-to-German translation task and said the model required significantly less time to train than the best previous systems. More important than the benchmark itself, the architecture scaled cleanly as researchers increased data, parameters, and compute. (research.google) That is the arc James Rosen-Birch compresses into a short social-media thread: shallow networks proved the concept, AlexNet proved depth could dominate at scale, and transformers proved that a new architecture could keep improving as the scale got much larger. His post works as a fast map of the two moments most people now use to explain why modern artificial intelligence feels discontinuous with the systems that came before it. (x.com, proceedings.neurips.cc, research.google) If you want the shortest version, it is this: deep learning stopped being a niche bet when depth, data, and graphics chips lined up in 2012, and it became today’s scaling machine when attention replaced older sequence machinery in 2017. Everything from image recognition to large language models sits somewhere downstream of those two breaks in the timeline. (proceedings.neurips.cc, research.google)