AMD HyLo extends context to 2M

- AMD researchers posted HyLo, a post-training recipe that converts small Llama and Qwen checkpoints into hybrid long-context models reaching 2M tokens. - The headline numbers are 32× longer usable context and more than 90% less KV-cache memory, with 1B–3B variants beating prior upcycled baselines. - It matters because long-context inference is usually a memory bill first, and HyLo targets cheaper open models instead of giant frontier systems.

Long-context language models usually hit a boring limit before they hit an intelligence limit — memory. The model may understand more text, but the KV cache balloons so fast that serving it gets expensive or just impossible. That is the gap AMD is trying to close with HyLo, a new recipe for turning existing small open models into much longer-context hybrids. The paper hit arXiv on April 27, 2026, and the claim is simple but punchy: up to 2M-token prefill and decoding, with more than 90% less KV-cache memory than comparable Transformer baselines. (arxiv.org) ### What is HyLo actually changing? HyLo is not a brand-new model trained from scratch. It is an “upcycling” method — basically, AMD takes pretrained Transformer checkpoints and swaps parts of the architecture for a hybrid mix of Multi-Head Latent Attention plus linear sequence blocks like Mamba2 or Gated DeltaNet, then does staged long-context post-training and distillation so the model does not fall apart. That m(arxiv.org)ls from zero is brutally expensive. (arxiv.org) ### Why is the KV cache the real bottleneck? When a model reads a long prompt, it stores attention keys and values for all those prior tokens. That cache grows with context length, so even a modest model can become a memory hog at 128K, 256K, or beyond. HyLo’s pitch is that if you shrink that memory footprint by more than 90%, you do not just get a nicer benchmark line — you make million-token inference practical o(arxiv.org)ke. AMD says comparable Llama baselines run out of memory beyond 64K context in its vLLM stack. (arxiv.org) ### Why 2M tokens sounds bigger than it is Two million tokens is enormous — roughly book-stack territory, not chatbot-history territory. But the key phrase is “usable context length.” A lot of long-context announcements are really tokenizer math, synthetic retrieval tricks, or settings that are too slow and too memory-heavy to serve economically. HyLo is more interesting because the paper ties the long window to an(arxiv.org) and decoding — rather than just saying the model was exposed to long sequences in training. (arxiv.org) ### Which models did AMD test? The paper focuses on 1B- and 3B-scale variants built from Llama- and Qwen-based checkpoints. That is a deliberate choice. AMD is not trying to win the “largest parameter count” contest here. It is aiming at the part of the market where open models are small enough to deploy widely, but still useful for document-heavy assistants, coding tools, retrieval systems, and persistent-memory agents. (arxiv.org) ### Does the quality hold up? That is the obvious catch with any efficiency trick — did the model get cheaper by getting worse? AMD says no, at least not in the headline results. The paper says HyLo keeps strong short- and long-context performance, beats prior upcycled hybrid baselines on RULER, and highlights one especially aggressive comparison: HyLo-Qwen-1.7B trained on 10B tokens outperforming JetNemotron, whic(arxiv.org)ense evals, and RULER-64K. That does not make it a universal winner, but it is a strong signal that the trade-off may be real. (arxiv.org) ### Why does AMD care about small hybrids? Because this fits a bigger AMD pattern. Last year the company was already pushing HybridLM work built from MLA and Mamba2 blocks, framed around lower memory use and faster inference on AMD hardware. HyLo looks like the long-context extension of that same strategy — not “build the smartest giant model,” but “make practical open models much cheaper to run.” (rocm.blogs.amd. ([arxiv.org)lligence/hybrid-models%2C-mla%2C/README.html)) ### So what changes if this holds up? The immediate win is cost. Long-document apps, agent memory, codebase analysis, and retrieval-heavy workflows all get constrained by context-memory economics before they get constrained by raw model IQ. If HyLo’s numbers survive wider testing, the interesting shift is not that every app suddenly needs 2M tokens. It (rocm.blogs.amd.com)hitecture surgery after pretraining can beat brute-force scaling for a lot of real deployments. If that bet lands, the long-context race stops being only about who can advertise the biggest window — and starts being about who can afford to use it.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.