New Benchmarks Rank LLMs on Consumer GPUs

A comprehensive evaluation benchmarked nine leading LLMs on consumer-grade 16GB VRAM GPUs using Ollama. The results identify Llama 3 8B, Mistral 7B, and Phi-3 Mini as top performers for workloads prioritizing privacy, efficiency, and zero-API costs. The study shows that INT4 and Q8 quantization make it feasible to run competitive models on modest hardware.

- Microsoft's Phi-3-Mini, at 3.8 billion parameters, is designed to rival the performance of much larger models like Mixtral 8x7B and GPT-3.5. It can be deployed on a device like an iPhone 14, where with 4-bit quantization it occupies just 1.8GB of memory and can generate over 12 tokens per second. - Meta's Llama 3 8B, an 8-billion parameter model, was trained on a dataset seven times larger than its predecessor, Llama 2, with a significant focus on code. On consumer GPUs like the RTX 3090, it can achieve a cost of just $0.228 per million output tokens when running with Ollama. - Mistral 7B, a 7.3-billion parameter model, utilizes Grouped-Query Attention (GQA) and Sliding Window Attention (SWA) to enable faster inference and handle longer sequences more efficiently. It outperforms the larger Llama 2 13B on all benchmarks and shows strong performance in code and reasoning tasks. - Quantization from 16-bit (BF16) to 4-bit (INT4) can increase throughput by over 2.5 times by reducing memory bandwidth limitations, with minimal impact on model accuracy for many tasks. For instance, a 32-billion parameter model can see its memory footprint shrink from 61GB to 18GB. - For production environments with concurrent users, frameworks like vLLM significantly outperform Ollama, with benchmarks showing vLLM achieving up to 793 tokens per second compared to Ollama's 41. Ollama is generally better suited for local development and prototyping. - The decision to self-host versus using a cloud API involves a significant cost-benefit analysis. Self-hosting can become more cost-effective at high volumes, potentially breaking even with cloud API costs within months at a usage of 30 million tokens per day. However, this requires factoring in often-underestimated personnel costs for DevOps and infrastructure management. - Enterprise-grade GPUs like the NVIDIA H100 and A100 are optimized for large-scale training and multi-GPU performance, while high-end consumer GPUs such as the RTX 4090 are highly capable for inference and fine-tuning of quantized models up to around 13 billion parameters. - Specialized inference servers like NVIDIA's TensorRT-LLM can provide peak performance on NVIDIA hardware by using optimizations like kernel tuning and in-flight batching, potentially offering up to 8 times faster inference compared to CPU-only platforms. However, vLLM often provides broader model support and an easier integration path with Hugging Face models.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.