Hybrid LLMs: The Next Architecture?

A new hybrid AI model called Olmo Hybrid, blending RNN and attention mechanisms, was published on March 5. This new architecture is reportedly ~2x more efficient in pre-training, leading to speculation that frontier models from OpenAI and Anthropic may already be using similar RNN-based designs. However, open-source tools like vLLM are still underdeveloped for these models, with stable implementations 3-6 months away.

The architectural shift away from pure transformers was pioneered by research from institutions like the Allen Institute for AI, the creators of Olmo. Their work demonstrated that by replacing 75% of the transformer blocks with gated DeltaNet heads—a type of recurrent neural network (RNN)—the model could achieve the same performance as its predecessor, Olmo 3, using nearly half the training tokens. This hybrid design, which alternates three DeltaNet layers with one traditional multi-head attention layer, also boasts a 75% improvement in inference efficiency for long-context tasks. This trend is not isolated. Other prominent models have adopted similar hybrid approaches. Qwen 3.5 utilizes a combination of Gated DeltaNet and a sparse Mixture-of-Experts architecture to power its multimodal agent capabilities. Similarly, Kimi Linear from Moonshot AI integrates its own version of a gated recurrent unit, Kimi Delta Attention, with full attention layers in a 3:1 ratio, reporting a 6x increase in decoding throughput and a 75% reduction in KV cache usage for million-token contexts. For startups, this architectural evolution presents a significant opportunity to reduce operational costs. Hybrid models, with their inherent efficiency, can lower the expenses associated with cloud computing for both training and inference. The reduced memory footprint of RNN-based components means that more powerful models can be run on less expensive hardware, democratizing access to near state-of-the-art AI capabilities. This shift could allow early-stage companies to build more sophisticated AI-powered features into their products without needing massive capital for GPU clusters. The move toward hybrid systems also has implications for the engineering talent market. While the fundamental skills in machine learning and deep learning remain crucial, expertise in RNNs, state-space models, and custom kernel development is becoming increasingly valuable. As open-source inference engines like vLLM rapidly add support for these new architectures, engineers who can navigate and optimize these more complex models will be in high demand. Recent updates to vLLM show that support for hybrid models like Olmo Hybrid is actively being integrated, closing the tooling gap faster than many anticipated.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.