SubQ shows 12M-token sparse model

- Subquadratic launched private beta access to SubQ 1M-Preview on May 5, alongside SubQ Code and SubQ Search, after showing a 12 million-token research run. - The company says its fully subquadratic architecture cuts attention compute by nearly 1,000× at 12 million tokens, with 92% RULER 128K accuracy. - If that holds outside demos, long-context apps could shift from retrieval-heavy hacks to just loading the whole corpus.

Long-context language models are supposed to let you dump in a whole codebase, a legal archive, or a giant research pile and just ask questions. The problem is that standard transformers get brutally expensive as context grows. Cost rises fast, latency rises fast, and the model often still misses the thing you care about. That is the gap Subquadratic is trying to attack — and this week the company opened private beta access to its first model after showing a 12 million-token research result. (subq.ai) ### What is the actual news? Subquadratic said on May 5 that it is launching early access to three products: SubQ 1M-Preview through an API, SubQ Code as a CLI coding agent, and SubQ Search as a long-context research tool. The company’s pitch is simple: this is the first LLM built on a fully subquadratic architecture, meaning compute is meant to grow linearly with context length rather than quadratically. (subq.ai) ### Why is quadratic attention such a problem? In a normal transformer, every token gets compared with every other token. That sounds fine at 8K or 32K tokens. It gets ugly at 1M. The number of interactions explodes, which means memory pressure, slower inference, and much higher cost. That is why so many “long-context” systems quietly lean on retrieval, chunking, and prompt tricks instead of truly reasoning over everything at once. (subq.ai) ### What is SubQ claiming it changed? SubQ says it rebuilt the architecture so compute grows linearly with context length. That is the big claim — not just a faster kernel, not just a sparse add-on, but a model family designed around subquadratic scaling from the start. In the company’s launch materials, it says the architecture reduces attention compute by almost 1,000× in a 12 million-token re(subq.ai)odels. (subq.ai) ### Does that mean the model can really use 12 million tokens? Sort of — but this is where you should separate “fits” from “uses well.” A model can technically accept a giant window and still fail to retrieve the right fact buried inside it. SubQ is leaning hard on that exact criticism. It says longer inputs do not automatically help if the model becomes less consistent about finding what matter(subq.ai)ence that its approach is not just cheaper but still accurate. (subq.ai) ### What benchmarks matter here? The company says SubQ 1M-Preview hit 92% on RULER 128K, a common long-context benchmark, with third-party verification. Its homepage also summarizes the product as handling 12M tokens while being 50× cheaper than leading frontier models. Those are strong numbers, but they are still company-presented numbers, so the real test is whether outside developers see the (subq.ai)orkloads. (subq.ai) ### Why launch code and search first? Because those are the clearest use cases for giant context. A coding agent benefits if it can ingest an entire repository in one pass instead of juggling chunks and handoffs between tools. A research product benefits if it can search across a huge pile of documents without building a brittle retrieval stack first. Basically, these are the places where long c(subq.ai)duct story. (subq.ai) ### What is the catch? The catch is that sparse or subquadratic attention has looked promising for years, but practical tradeoffs usually show up somewhere else — accuracy, training complexity, hardware friendliness, or weird failure modes on real tasks. SubQ’s announcement is interesting because it claims to improve context length, speed, and cost together. But until more independent evaluation(subq.ai) validation. (subq.ai) ### Bottom line This is one of the more ambitious long-context announcements in a while. If SubQ can really make million-token reasoning cheap enough and reliable enough, a lot of today’s retrieval-heavy AI plumbing starts to look like a workaround, not a permanent architecture. (subq.ai)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.