llama.cpp Shows Strong Performance on MoE Models
Users are reporting impressive local inference speeds with llama.cpp, achieving 25-80 tokens/second on Mixture of Experts (MoE) models like Qwen3.5-35B on consumer RTX 40-series GPUs. The performance highlights a popular path for hands-on deployment experiments, often bypassing the limitations of higher-level tools like Ollama.
Mixture of Experts (MoE) architectures achieve efficiency by activating only a fraction of a model's total parameters for any given input. A "gating network" or router dynamically selects which specialized sub-networks, or "experts," should process each token, keeping computational costs down while allowing for a massive total parameter count. This means that while a model like Qwen3.5-35B has a large number of parameters, only a subset is used during inference, which is key to its performance. The performance advantage of llama.cpp often comes from its close-to-the-metal design, which minimizes overhead compared to more user-friendly wrappers like Ollama. While Ollama uses llama.cpp as its core engine, its added layers for accessibility and ease of use can introduce a performance penalty. Benchmarks have shown llama.cpp to be anywhere from 13% to 80% faster than Ollama on the same hardware. A critical technology for running these large models on consumer hardware is quantization, the process of reducing the precision of model weights to save memory. Formats like GGUF are specifically designed for this, enabling massive models to fit into the VRAM of cards like the RTX 4090. Advanced quantization schemes can even be tailored to MoE models, applying different levels of compression to the expert layers versus the more consistently used parts of the model. Deploying MoE models, even when quantized, presents a unique challenge: the entire model, including all experts, must be loaded into memory (VRAM + RAM), even though only a fraction is used for each token. Llama.cpp provides granular control over this process, allowing users to strategically offload layers, particularly the less-frequently-used expert layers, to the CPU to manage VRAM limitations effectively. This level of control is a key reason why it's a preferred tool for hands-on experimentation.