New Sparse LLM Claims to Beat Mixtral, Gemma

A new open-source model, TurboSparse-LLM, is outperforming Mixtral and Gemma on speed and cost by leveraging extreme weight sparsity. The model uses custom Sparse Attention CUDA kernels to skip zeroed parameters, resulting in higher throughput and lower memory usage for inference.

The core innovation behind TurboSparse-LLM is a new activation function called dReLU, which enables extreme levels of activation sparsity. Unlike traditional activation functions like SwiGLU and GeGLU, dReLU is designed to make the model's neurons output zero for a majority of inputs, significantly reducing the number of calculations needed during inference. This approach has led to a 2-5x speedup in decoding while maintaining or even improving performance on benchmarks compared to the original Mistral and Mixtral models. For the TurboSparse-Mixtral-47B model, this technique pushes the sparsity of the Mixture-of-Experts (MoE) layers from 75% to an impressive 97%. This means that during an inference pass, only about 4.3 billion of the model's 47 billion parameters are actually used. The TurboSparse-Mistral-7B model sees its feed-forward network (FFN) sparsity increase to 90%. To realize these performance gains, the researchers developed a specialized inference engine called PowerInfer. PowerInfer is designed to capitalize on the activation locality of sparse models by dynamically managing which neurons are loaded onto the GPU ("hot" neurons) and which are handled by the CPU ("cold" neurons). On mobile devices, the combination of TurboSparse-Mixtral-47B and the PowerInfer-2 engine has demonstrated speeds of up to 11 tokens per second. For ML engineers working with popular serving frameworks, it's important to note that direct, optimized inference for TurboSparse models is not yet available out-of-the-box in tools like vLLM or TensorRT-LLM. The Hugging Face model card for TurboSparse-Mixtral indicates that the specialized code for acceleration is still being refined and that for now, the model can be run like a standard dense model. This implies that leveraging the full potential of its sparsity in a production environment would currently require custom kernel development and integration. While vLLM and TensorRT-LLM have growing support for various forms of sparsity, this is an evolving area. vLLM has recently added support for sparse embedding models like SPLADE and has developed fused kernels for MoE models, which also manage a form of sparsity. TensorRT-LLM, through its Model Optimizer, supports techniques like 2:4 structured sparsity. However, integrating a model with the unique activation sparsity pattern of TurboSparse-LLM would be a non-trivial engineering task. The researchers behind TurboSparse-LLM are from Shanghai Jiao Tong University and Tsinghua University. Their work highlights a growing trend in LLM research that focuses on optimizing inference efficiency through architectural innovations, rather than solely scaling up model size. This approach could lead to significant cost savings in GPU infrastructure for enterprise applications. The TurboSparse models were trained on a dataset of 150 billion tokens and, according to their model card, may still have performance limitations in certain tasks and non-English languages due to the training data composition. However, the models are designed to be fine-tuned using any standard framework, allowing for adaptation to specific enterprise use cases.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.