SubQ LLM hits 12M token context

- Miami startup Subquadratic came out of stealth on May 5 and unveiled SubQ, a long-context model it says can read 12 million tokens at once. - The headline claim is speed: 52× faster than dense FlashAttention-style attention at 1 million tokens, with 92.1% retrieval at 12 million. (thenewstack.io) - If the numbers hold up, long-context AI stops being a retrieval workaround problem and starts looking like a model architecture shift. (thenewstack.io)

A long-context model is only as useful as the attention mechanism underneath it. That’s the real story here. On May 5, Subquadratic — a Miami startup that launched with $29 million in seed funding — said its new model, SubQ, can handle a 12 million-token context window without the usual quad(thenewstack.io)spend huge amounts of time and money just looking across the prompt. (siliconangle.com)Transformer attention usually compares each token with every other token. That’s the famous quadratic problem. Double the input length and the work roughly quadruples. It’s why “just give the model more context” stops being practical fast, even when the model technically supports it. (siliconangle.com) ### What is SubQ claiming to change? Subquadrati(siliconangle.com)is simple: don’t compute full attention everywhere. Let the model focus exact attention on the parts of the sequence that matter, instead of paying the full all-to-all cost across millions of tokens. The company says that changes the scaling from the usual quadratic pattern to something much closer to linear in practice. (thenewstack.io) ### Why does “12 million tokens” sound so wild? Because it is far beyond what most frontier APIs expose today. The current top end for mainstream cloud models is around 1 million tokens, and even that is expensive enough that most developers still use retrieval, chunking, and agent pipelines to avoid stuffing everything into one prompt. SubQ’s pitch is that you can keep much more of the original material in view — entire codebases, long work histories, or very large research corpora — without building so many workaround layers around the model. (thenewstack.io) ### Are the benchmark numbers actually big? Yes — on paper. The company says SubQ runs 52× faster than dense attention at 1 million tokens, scores 92.1% on needle-in-a-haystack retrieval at 12 million tokens, and reaches 83 on MRCR v2. It also says the model hit 82.4% on SWE-bench, edging past recent Anthropic and Google numbers cited in launch coverage. Those are attention-grabbing results because they combine long context with capability claims, not just a raw window-size stunt. (thenewstack.io)this already? Absolutely. This is a crowded research lane. Systems like MInference and newer token-sparse methods already try to skip low-value attention work so long prompts become cheaper to process. But most of those methods are inference optimizations layered onto existing transformers. SubQ is pitching something bigger — not just a faster kernel, but a different frontier-model architecture built around sparse attention from the start. (arxiv.org)t numbers are coming from the company and launch coverage around the launch. That doesn’t mean they’re wrong. But it does mean the real test is whether outside researchers and developers can reproduce the speedups, quality, and scaling behavior on messy workloads — especially code, multi-document reasoning, and agent loops where “everything matters” more than a retrieval benchmark suggests. (thenewstack.io) #(arxiv.org)ause a lot of today’s AI product design is basically a coping strategy for quadratic attention. Retrieval-augmented generation, summarization stacks, and multi-step agent decomposition all exist partly because feeding the whole problem to the model is too expensive. If SubQ-like architectures really make very long prompts cheap and accurate, some of that scaffolding gets simpler — and some products get redesigned around persistent, always-available context instead. (t([thenewstack.io)## Bottom line? The interesting part isn’t that one model hit 12 million tokens. It’s that a startup is claiming the transformer’s most annoying scaling law is no longer the main constraint. If that survives outside testing, long-context AI stops being a bag of hacks and starts becoming the default way these systems work. (thenewstack.io)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.