Lottery ticket revived

Researchers have revived the lottery‑ticket idea for modern GPUs by using structured sparsity to prune huge models without losing accuracy — reportedly enabling up to 90% parameter pruning while keeping performance intact. ( ). On current NVIDIA Ampere+ hardware the approach showed roughly 2× throughput and about 50% less memory use in early reports, which could materially change how big models are deployed. ( )

A giant language model is mostly blank space in disguise. The surprise is that researchers keep finding you can delete huge chunks of it and still get almost the same answers, if you delete the right chunks. (arxiv.org) That idea is called the lottery ticket hypothesis. Jonathan Frankle and Michael Carbin’s 2018 paper argued that a big neural network contains much smaller subnetworks, or “winning tickets,” that can train to the same accuracy as the full model. (arxiv.org) The catch was hardware. Early winning tickets usually used unstructured sparsity, which means zeroes are scattered like random holes in Swiss cheese, and graphics processors are bad at skipping random holes fast enough to matter. (proceedings.mlr.press) Chip makers answered with structured sparsity. NVIDIA’s Ampere and Hopper graphics processors support a fixed “2 out of 4” pattern, which means that in every group of four weights, at least two must be zero. (developer.nvidia.com) That pattern is much easier for hardware to exploit. NVIDIA says those sparse Tensor Cores can potentially double matrix-multiplication throughput, because the chip only has to process the nonzero half of each 2:4 block. (developer.nvidia.com) But the same pattern also made pruning harder. A 2024 paper called Weight Recover Prune found that forcing large language models into 2:4 sparsity causes a noticeable accuracy drop, because the group rule throws away useful weights along with useless ones. (aclanthology.org) That is why this week’s reports are getting attention. The claim is that researchers have found a way to bring the lottery-ticket idea back in a hardware-friendly form, so the sparse subnetwork is not just smaller on paper but faster on today’s NVIDIA cards. (x.com) The key shift is from “find any sparse subnetwork” to “find a sparse subnetwork the chip already knows how to accelerate.” That closes the old gap between pruning research and real deployment, where a model could be 90 percent pruned in a paper and still run like a dense model in production. (proceedings.mlr.press) This is not the first hint that structured winning tickets can work. A 2022 International Conference on Machine Learning paper reported structurally sparse lottery tickets with up to 64.93 percent runtime savings at tested sparsity levels while keeping accuracy comparable to dense models. (proceedings.mlr.press) If the new results hold up at large language model scale, the practical math changes fast. A model that keeps accuracy after heavy pruning can cut memory traffic, fit on fewer graphics processors, and serve more users per machine than the dense version. (developer.nvidia.com) There is still a reason to be careful. The strongest public, primary-source evidence I could verify today shows that 2:4 structured sparsity is real on NVIDIA hardware and that prior papers have partly closed the lottery-ticket hardware gap, but the exact new headline numbers in the posts are still early reports rather than a fully vetted paper I could cite directly. (developer.nvidia.com, proceedings.mlr.press, x.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.