Serving trade-offs: efficiency vs portability
A recent serving‑stack comparison argues TensorRT‑LLM still extracts maximum NVIDIA performance but requires more setup and ties teams to one vendor, while alternatives trade peak efficiency for portability. The write‑up frames the core decision as flexibility versus absolute throughput and suggests teams choose based on workload heterogeneity. (theaiengineer.substack.com)
A serving stack is the software layer that turns a language model into an application programming interface, batching requests, caching past tokens, and deciding how hard a graphics processor works. In 2026, the split is less about whether teams can serve models at all and more about whether they want maximum NVIDIA speed or broader portability. (docs.nvidia.com) NVIDIA says TensorRT-LLM builds TensorRT engines with “state-of-the-art optimizations” for inference on NVIDIA graphics processors, and its developer page says the system is “specifically customized for NVIDIA platforms.” That design is the reason recent comparisons keep placing it at the top of the throughput table on NVIDIA hardware. (docs.nvidia.com, developer.nvidia.com) vLLM describes itself as a “high-throughput and memory-efficient” serving engine and says it offers a drop-in OpenAI-compatible application programming interface plus support across NVIDIA CUDA, AMD ROCm, Google Tensor Processing Units, Amazon Web Services Neuron, Intel Gaudi, Apple Silicon, and more. That makes it the default middle ground for teams that expect model changes or mixed hardware. (vllm.ai) SGLang sits closer to vLLM than to Ollama in production use, but it pushes a different idea: make repeated prompts cheaper by reusing shared prefixes in memory. Its documentation says it is built for “low-latency and high-throughput inference” from one graphics processor to large clusters, and the project’s RadixAttention design focuses on reusing cached tokens across related requests. (docs.sglang.io, lmsys.org) Ollama targets a different job. Its documentation calls it “the easiest way to get up and running” with large language models, with quick local setup on macOS, Windows, Linux, and Docker, which is why it shows up in comparisons even when it is not chasing data-center throughput. (docs.ollama.com, ollama.com) Outside benchmarks in March 2026 show the trade-off in numbers. Spheron reported 2,100 tokens per second for TensorRT-LLM, 1,920 for SGLang, and 1,850 for vLLM on one H100 80-gigabyte graphics processor running Llama 3.3 70B at FP8, while TensorRT-LLM’s cold start was about 28 minutes versus roughly 58 to 62 seconds for SGLang and vLLM. (spheron.network) That setup cost is part of the divide. NVIDIA’s own documentation centers on building engines, quantization choices such as FP8 and NVFP4, and deployment paths like `trtllm-serve`, while vLLM’s homepage leads with a single install command and an OpenAI-compatible server. (nvidia.github.io, vllm.ai) The hardware question is just as concrete. TensorRT-LLM is built for NVIDIA graphics processors, while vLLM and SGLang both advertise support beyond one vendor, and Ollama is designed to run models on local machines across the main desktop operating systems. (developer.nvidia.com, vllm.ai, docs.sglang.io, docs.ollama.com) That leaves buyers with a narrower decision than the marketing suggests. Teams serving one stable model on fleets of NVIDIA H100s can justify an engine-building workflow for extra throughput, while teams rotating models, mixing accelerators, or shipping local prototypes usually accept lower peak efficiency in exchange for faster setup and fewer hardware constraints. (developer.nvidia.com, vllm.ai, docs.ollama.com, docs.sglang.io) The comparison is no longer about finding one winner for every deployment. It is about where a team wants to pay: in engineering time up front for NVIDIA-tuned speed, or in ongoing compute overhead for a stack that can move with the workload. (docs.nvidia.com, vllm.ai, docs.sglang.io, docs.ollama.com)