Local models vs API throughput
Benchmarks this week showed local model deployments running at about 5–10 transactions per second for 'smart' models while API routes were measured around 150 tps in comparable tests. (x.com) The raw throughput gap is one reason teams still route high‑volume inference to cloud APIs when latency and scale matter. (x.com)
Running a large language model on your own server is often much slower than calling a managed cloud API, even when both are answering the same kind of prompt. (x.com) A benchmark shared this week put local deployments of “smart” models at about 5 to 10 transactions per second, while an API route in a comparable test reached roughly 150 transactions per second. The post was published on X and framed the gap as one reason teams keep high-volume workloads in the cloud. (x.com) In plain terms, throughput is how many requests a system can finish in a second. For language models, that number depends on the model size, the prompt length, the answer length, the number of users hitting the system at once, and how efficiently the software batches work together on the graphics processor. (docs.vllm.ai) Open-source serving stacks such as vLLM are built to raise that number. The project describes itself as a library for inference and serving and says its speed comes from techniques including continuous batching, prefix caching, quantization, and optimized attention kernels. (docs.vllm.ai) Even then, benchmarking is tricky. vLLM’s own documentation says its built-in benchmarks are mainly for feature evaluation and regression testing, and it recommends a separate framework called GuideLLM for production server benchmarks. (docs.vllm.ai) That is why side-by-side numbers can vary so widely. Anyscale’s LLMPerf project was built to make provider comparisons more reproducible, and one reference workload it cites uses 550 input tokens and 150 output tokens so different services can be measured on the same shape of request. (anyscale.com) Cloud providers have another advantage: they tune the whole stack around shared traffic. Together AI said in January 2026 that teams chasing low latency and low cost usually get gains from better scheduling, larger effective batch sizes, and lighter-weight formats such as FP8 or FP4 quantization, which it said can improve throughput by 20% to 40% in production deployments. (together.ai) Managed APIs also expose rate limits in ways that encourage high-volume use. OpenAI’s API documentation says limits are measured in requests per minute and tokens per minute, and it advises developers to batch tasks into each request when request-per-minute limits are the bottleneck. (developers.openai.com) That does not mean local inference is obsolete. Teams still run models on their own hardware for data control, predictable costs at steady utilization, offline use, and cases where an open-weight model is “good enough” without paying per call. (docs.vllm.ai; developers.openai.com) The newer split is less “local versus API” than “which jobs need which lane.” If a product needs hundreds of fast, concurrent responses, the benchmark gap in this week’s test helps explain why cloud APIs still win that traffic. (x.com)