SubQ launches 12M‑token sparse LLM
- Subquadratic came out of stealth on May 5 and launched SubQ 1M-Preview, a long-context model plus API, coding agent, and search tool. - The eye-catching claim is architectural: 12M-token research context, 52.2× faster prefill at 1M tokens, and nearly 1,000× less attention compute at 12M. - If it holds up, long-context AI gets cheaper — and some RAG-heavy workflows start looking like temporary hacks.
Long-context AI is the part of the model market that keeps promising magic and then handing developers a pile of retrieval hacks. You want the model to read the whole repo, the whole case file, or the whole months-long agent history. Instead, you chunk, rank, summarize, and pray the right passages make it into the prompt. Subquadratic’s launch on May 5 is interesting because it claims to attack that bottleneck at the architecture level, not with another wrapper. The company says its new model, SubQ 1M-Preview, is built on fully subquadratic sparse attention, with a 12 million-token research context and a 1 million-token product preview. (subq.ai) ### What actually launched? Subquadratic launched three things at once: SubQ 1M-Preview, an API in private beta, a command-line coding product called SubQ Code, and a search product called SubQ Search. The company also launched out of stealth with $29 million in seed funding. The pitch is simple — stop treating giant context windows as a luxury feature and make them usable enough for real workflows. (subq.ai) ### What is the core claim? The core claim is that SubQ breaks the usual transformer cost curve. Standard attention compares every token with every other token, so cost rises quadratically as context gets longer. Subquadratic says its SSA system routes attention only to the positions that matter, which makes compute grow linearly with context length instead. That is the whol(subq.ai) practice, million-token prompts stop being a stunt. (subq.ai) ### Why does 12 million tokens matter? Because 12 million tokens is not “a slightly bigger context window.” It is big enough to fit things people currently break into pieces — entire codebases, long pull-request histories, and persistent agent state. Subquadratic’s own examples put the Python 3.13 standard library at about 5.1 million tokens and six months of React pull requ(subq.ai)elling the idea that the model can keep the whole working set in view instead of constantly reloading fragments. (subq.ai) ### What numbers made people look up? Three numbers. First, Subquadratic says SSA hits a 52.2× prefill speedup over dense attention at 1 million tokens. Second, it says attention compute drops by almost 1,000× at 12 million tokens. Third, the company markets the system at about one-fifth the cost of other leading LLMs, with the homepage listing 150 tokens per second. Those are huge claims —(subq.ai)the economics if true. (subq.ai) ### Does it look competitive on quality? On the benchmarks Subquadratic chose, it looks more serious than a pure efficiency demo. The company lists 95.0% on RULER at 128K, 65.9% on MRCR v2 with eight needles at 1 million tokens, and 81.8% on SWE-Bench Verified. That mix matters because the usual failure mode for exotic attention schemes is obvious: you ge(subq.ai)g it. Subquadratic is arguing that tradeoff is not mandatory anymore. (subq.ai) ### So why are researchers skeptical? Because this field is littered with “linear-time attention” ideas that looked great in theory and then lost quality, broke at scale, or depended on selective benchmarks. Even the company’s friendliest coverage lands on the same caveat — independent validation is still thin, and the technical report was not yet available at launch beyond high-level expla(subq.ai)claim is plausible enough to watch, but not settled enough to treat as fact. (subq.ai) ### What changes if SubQ is real? The biggest shift is not just bigger prompts. It is less need for brittle retrieval plumbing. A lot of current “agent” design is really compensation for context scarcity — vector search, chunking, memory summaries, rerankers, and orchestration layers built to squeeze around transformer limits. If SubQ’s architecture survi(subq.ai)around a temporary hardware-tax problem. (subq.ai) ### Bottom line? Subquadratic did not just launch a model. It launched a challenge to the assumption that frontier LLMs have to stay transformer-shaped forever. But the catch is obvious — until outsiders reproduce the speed, cost, and quality claims, this is a very interesting architectural bet, not a closed case. (subq.ai)