AI2 Releases Efficient 'OLMo Hybrid' Model

The Allen Institute for AI (AI2) has released OLMo Hybrid, a 7B open-source model combining transformer and RNN layers. The new architecture reportedly halves the number of training tokens required and improves long-context inference throughput by 75%, prioritizing efficiency.

The OLMo Hybrid's architecture replaces 75% of the standard transformer attention layers with a linear recurrent neural network (RNN) mechanism known as Gated DeltaNet. This is achieved through a 3:1 pattern, where three Gated DeltaNet layers are followed by one traditional multi-head attention layer, a structure repeated throughout the model. This design aims to combine the state-tracking strengths of RNNs with the precise recall abilities of transformers. This hybrid structure provides a significant boost in training efficiency. In a controlled comparison, OLMo Hybrid achieved the same accuracy on the MMLU benchmark as its predecessor, OLMo 3, but with 49% fewer training tokens. This near-doubling of data efficiency means a model of the same capability can be trained with half the data and compute resources. For tasks involving long contexts, the new architecture shows substantial performance gains. After long-context extension, OLMo Hybrid demonstrates a 75% improvement in inference throughput compared to a pure transformer architecture. On the RULER long-context benchmark at 64k context length, the hybrid model scores significantly higher than its predecessor, OLMo 3 7B. The development of OLMo (Open Language Model) is a core project of the Allen Institute for AI (AI2), co-led by Hanna Hajishirzi, a Senior Director at AI2 and a professor at the University of Washington. The project's guiding principle is to be "truly open," providing full access to training data, code, and model checkpoints to advance the scientific understanding of language models. While OLMo Hybrid shows clear performance gains over its pure transformer counterpart within the OLMo family, it enters a competitive field of 7B-class models. For context, models like Mistral 7B have been noted for strong performance in commonsense reasoning, while previous versions of OLMo have shown an edge in knowledge-intensive tasks. Llama 3 8B is also a strong performer in this category, often cited for its balance of performance and efficiency. This release is part of a broader trend exploring architectures that mix attention with RNN-like mechanisms. Other models in this space include Qwen 3.5, Kimi Linear, and Nvidia's Nemotron-H. The success of OLMo Hybrid provides strong evidence that these hybrid designs are not just for inference efficiency but can also lead to more expressive and scalable models during pre-training.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.