SubQ demos 12M-token context speedup

- AI startup Subquadratic launched SubQ 1M-Preview on May 5 and said its new “fully subquadratic” model can stretch to 12 million tokens. - The headline claim is economics: 52× faster than FlashAttention at 1 million tokens, under 5% of Opus cost, and 95% RULER. - If those numbers hold outside company demos, long-context AI gets cheaper enough to replace a lot of retrieval and chunking hacks.

Long-context AI is the part of the model market that keeps promising magic and then handing you plumbing. You can give a model more text, more code, more documents — but the cost usually explodes, and the model often gets worse at using the extra room. That is the gap Subquadratic is trying to hit. On May 5, the startup launched SubQ 1M-Preview and said its architecture can scale to a 12 million-token research context while keeping compute growth linear, not quadratic. (subq.ai) ### What is the thing they actually built? SubQ is a language model built around what the company calls a “fully subquadratic” architecture. In plain English, that means the expensive part of attention is not supposed to blow up the usual way as context gets longer. The launch product is SubQ 1M-Preview, plus a private-beta API, a coding agent called SubQ Code, and (subq.ai)le: stop treating million-token context as a stunt and make it usable in normal software. (subq.ai) ### Why is long context so hard? Standard transformer attention compares every token with every other token. That pairwise pattern is why compute rises quadratically with sequence length. FlashAttention improved the practical speed and memory behavior of this setup by being smarter about GPU memory traffic, but it still sits inside the same quadratic attention world(subq.ai)aster again, yet the core scaling limit remained. (arxiv.org) ### So what changed here? Subquadratic says it moved past “make quadratic attention less painful” and changed the scaling law itself. The company claims compute grows linearly with context length, and it says a research run reached 12 million tokens. It also says that at 1 million tokens, SubQ is 52× faster than FlashAttention, while attention compute drops by alm(arxiv.org)illion tokens. Those are company numbers, so the right posture is interest, not surrender. But if they survive outside testing, this is a real architecture story, not just a kernel optimization story. (subq.ai) ### What about model quality? This is the usual catch with efficient attention schemes — they often get cheaper by getting worse. Subquadratic is leaning hard against that history. It says SubQ 1M-Preview hit 95% on the RULER 128K benchmark in a third-party-verified run, alongside strong needle-in-a-haystack and exact-copy results. RULER matters because it was built(subq.ai)cross 13 synthetic tasks, not just simple retrieval. (subq.ai) ### Why compare it to Opus? Because people do not buy “attention mechanisms.” They buy outcomes — latency, cost, and whether a model can hold a giant working set in memory. Subquadratic says SubQ runs at under 5% of Opus cost for these long-context jobs. That frames the launch less as a research curiosity and more as a product attack on expensive frontier-model work(subq.ai)s, and document-heavy enterprise tools. (subq.ai) ### Does this kill RAG? Not exactly. Retrieval-augmented generation exists because shoving an entire corpus into a transformer is usually too expensive and too brittle. Subquadratic’s own launch post basically says the industry built chunking, retrieval, and prompt-routing workarounds around that limitation. Cheaper million-token context does not erase retrieval, bu(subq.ai)uld just load the whole repo, the whole case file, or the whole project history and reason in one pass. (subq.ai) ### What should you watch next? Independent replication. That is the whole ballgame. The important question is not whether a startup can post a benchmark chart. It is whether outside developers can reproduce the speedups, whether quality holds on messy real tasks, and whether the economics still look good once serving overhead shows up. If yes, long-context AI stops(subq.ai)default infrastructure. (subq.ai) ### Bottom line SubQ matters because it is making a bigger claim than “we optimized transformers.” It is saying the cost curve for long context can bend. If that turns out to be true, a lot of today’s AI stack starts to look like a temporary workaround.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.