AI2 Releases Hybrid Transformer Model
Allen Institute for AI just released OLMo Hybrid, a 7-billion parameter open-source model combining transformer and RNN architectures. It halves token costs compared to predecessors and significantly improves inference throughput for long-context tasks. Perfect for data scientists working on complex datasets.
OLMo Hybrid's architecture replaces 75% of the standard transformer attention layers with a modern recurrent neural network (RNN) design called Gated DeltaNet. The model alternates between these two structures, using a repeating pattern of three DeltaNet layers followed by one multi-head attention layer. This design aims to combine the state-tracking strengths of RNNs with the precise recall abilities of transformers. This hybrid structure provides significant efficiency gains. On the MMLU benchmark for general knowledge, OLMo Hybrid achieves the same accuracy as its predecessor, OLMo 3, but uses 49% fewer training tokens. This translates to a roughly 2x improvement in data and compute efficiency during the pre-training phase. The model was pretrained on a massive 6 trillion tokens. The training process was notable for being one of the first to use NVIDIA's HGX B200 GPUs, migrating from H100s midway through. This entire process for the 7-billion parameter model was completed in just over six days on a cluster of 512 GPUs. Beyond training efficiency, the hybrid design excels at long-context tasks. When handling inputs of 64,000 tokens, OLMo Hybrid with the DRoPE method achieves a score of 85.0 on the RULER long-context benchmark, a significant jump from the 70.9 scored by the pure transformer-based OLMo 3. The Allen Institute for AI (AI2) has released the model family under a permissive Apache 2.0 license. In a move championing open science, the release is comprehensive, including not only the model weights but also all intermediate checkpoints, training code, and a detailed technical report available on Hugging Face.