Sub-1B MoE shows strong gains
- Zyphra launched ZAYA1-8B on May 6, an 8.4B-parameter MoE with just 760M active parameters, claiming reasoning and coding results against far larger models. - The strongest datapoint is density: ZAYA1-8B posts 89.1 on AIME’26 and 65.8 on LiveCodeBench-v6 while activating under 1 billion parameters. - The bigger shift is serving economics — compressed attention and KV-cache tricks are moving memory bandwidth, not weights alone, to center stage.
Small MoE models are having a real moment. The headline this week is Zyphra’s ZAYA1-8B — released May 6 — which packs 8.4B total parameters but only 760M active ones per token, then still lands benchmark numbers that normally belong to much bigger systems. That matters because the old mental model was simple: if you wanted stronger reasoning, you paid with more active parameters and more expensive serving. Turns out the new bottleneck is often memory movement, especially attention and KV cache, not just raw model size. (zyphra.com) ### What does “sub-1B active” actually mean? A mixture-of-experts model keeps many expert blocks around, but only routes each token through a small subset of them. So ZAYA1-8B is not a 760M-parameter model in the usual dense sense — it is an 8.4B MoE whose active path is 760M. That distinction is the whole trick. You get a larger total capacity without paying the full per-token compute bill every time. (huggin([zyphra.com))) ### Why is this week’s release getting attention? Because the benchmark spread is hard to ignore. Zyphra’s model card shows 89.1 on AIME’26, 71.6 on HMMT Feb. ’26, 65.8 on LiveCodeBench-v6, and 71.0 on GPQA-Diamond. In the same table, it is in range of or ahead of several much larger open models, including systems with 3B, 6B, or more active parameters. Vendor benchmarks always need healthy skepticism(huggingface.co)eter story is unusually strong. (huggingface.co) ### What changed under the hood? Part of the answer is attention. Zyphra says ZAYA1-8B uses Compressed Convolutional Attention, or CCA, which pushes attention work into a compressed latent space to cut compute, memory, and parameter cost. Basically, the model is not just “small and sparse.” It is also built to make the expensive part of long-context inference cheaper. That matters more than it used to, be(huggingface.co)zyphra.com) ### Why does KV cache keep coming up? Because MoE saves compute in the expert layers, but the KV cache from attention can still stay dense. vLLM’s April writeup makes the serving problem plain: at 128k-plus contexts, KV cache often dominates GPU memory, and every decode step has to read a large chunk of it. Their FP8 path cuts KV-cache storage roughly in half and can reduce per-token memory-bound decode cost to 54% (zyphra.com)n, not a training win — but it directly changes deployment math. (vllm-project.github.io) ### So is this only about quantization? No — there are two separate moves happening. One is compressing the cache or attention itself. The other is getting smarter about where cache lives and how it is shared. A recent paper, MoE-nD, shows per-layer routing over different compression choices can preserve long-context quality far better than one-size-fits-all compression, matchi(vllm-project.github.io)he whole cache the same way” is leaving performance on the table. (arxiv.org) ### What about agent systems and distributed serving? That is the other half of the story. ForkKV, posted in April, targets multi-agent and multi-LoRA serving where many agents share a huge common context but diverge slightly. Its copy-on-write design splits KV cache into a shared chunk plus lightweight agent-specific deltas, and the paper reports up to 3.0× throughput over prior systems. Different setup, same lesso(arxiv.org)ure. (arxiv.org) ### What should engineers take from this? Stop sizing deployments from parameter count alone. For modern MoE systems, the real constraint is often active parameters plus attention design plus KV-cache policy plus interconnect overhead. A model that looks “bigger” on paper can be cheaper to run than a smaller dense model if it activates less compute and moves less memory. (zyphra.com) not just that one startup shipped a punchy small MoE. It is that sub-1B active models now look materially more credible — and the reason is full-stack design, from routing to attention to cache handling, not one magic benchmark. (zyphra.com)