OpenTelemetry monitors vLLM metrics
- Dash0 published a guide on May 5 showing how to observe vLLM with OpenTelemetry, combining traces and Prometheus metrics for production inference debugging. - vLLM exposes TTFT, end-to-end latency, prefill and decode timings, request states, KV-cache usage and preemption counters through its metrics endpoint. - vLLM docs include an OpenTelemetry proof of concept and production metrics pages operators can use for implementation details.
Dash0 published a community guide on May 5 showing how to monitor vLLM with OpenTelemetry, adding traces for inference requests and metrics for cache pressure, queue depth and latency phases. vLLM already ships with OpenTelemetry support for traces and a Prometheus-compatible `/metrics` endpoint for serving metrics, according to Dash0 and the project’s own documentation. OpenTelemetry is a vendor-neutral framework for collecting traces, metrics and logs, and its generative AI work is aimed at standardizing telemetry around model interactions, token usage and response metadata. In the vLLM case, that means operators can combine request traces with engine-level counters and histograms instead of treating model serving like a generic HTTP service. ### Which vLLM signals are the ones practitioners are actually being told to watch? (dash0.com) vLLM’s metrics documentation lists time-to-first-token, inter-token latency, end-to-end request latency, request prefill time and request decode time among the request-level histograms exposed through `/metrics`. The same docs list server-level gauges and counters including requests running or waiting, KV-cache usage percentage, prefix-cache queries and hits, and cumulative preemptions. (opentelemetry.io) Dash0’s guide names cache utilization, time to first token, preemption rate and queue depth as the inference-specific signals needed for capacity planning and latency debugging. The post says standard application performance monitoring can show a slow request, but not whether the cause was KV-cache preemption, scheduler queue pressure, a long prefill phase or a decode bottleneck. (docs.vllm.ai) ### Why does TTFT keep showing up ahead of ordinary latency charts? vLLM’s design docs separate request-level service indicators from server-level state, and they identify TTFT as a first-class histogram alongside end-to-end latency and decode timing. That distinction matters because a model can have acceptable overall completion time while still feeling slow to users if the first token arrives late. (dash0.com) Dash0’s post says TTFT and time per output token can diverge significantly under batched workloads. In practice, that gives operators a way to tell whether the problem is startup delay before generation begins or slower token emission once decoding is underway. ### What does OpenTelemetry add beyond the metrics endpoint? vLLM’s OpenTelemetry proof-of-concept shows one trace per request when the server is started with an OTLP traces endpoint, and the example uses Jaeger to visualize spans from both a client and the vLLM server. (docs.vllm.ai) The docs also show FastAPI instrumentation so application spans can appear in the same trace as model-serving spans. (dash0.com) Dash0’s guide extends that pattern into a RAG example with a FastAPI app, vLLM server and OTel Collector in one Docker Compose stack. That setup lets an operator follow a request across retrieval, application logic and inference instead of inspecting model latency in isolation. ### Where do cache pressure and scheduler problems show up first? vLLM’s production metrics include `kv_cache_usage_perc`, prefix-cache query and hit counters, request-state gauges and a cumulative preemption counter. (docs.vllm.ai) Those are the measurements that show whether the engine is running out of room, reusing cached prefixes effectively or interrupting work under load. Dash0 says KV-cache pressure can degrade throughput without surfacing as explicit errors, and that queue depth can show capacity strain before users notice it. (dash0.com) The guide frames traces and metrics as complementary: traces explain an individual slow request, while metrics show whether the slowdown reflects broader scheduler contention. ### How mature is the OpenTelemetry side of this stack? (docs.vllm.ai) vLLM documentation includes an OpenTelemetry setup example using OTLP exporters and Jaeger, and Dash0 says vLLM ships with built-in instrumentation for traces when the optional dependencies are installed. Dash0 also notes that vLLM pins its own span attribute names rather than following the evolving OpenTelemetry GenAI semantic conventions directly. (dash0.com) The OpenTelemetry project said in a December 2024 blog post that GenAI semantic conventions and instrumentation libraries were still being developed to standardize traces, metrics and events across AI systems. That leaves current vLLM observability workable in production, but still part of a moving standards picture. ### Where should operators look next if they want to implement this now? (docs.vllm.ai) vLLM’s production metrics page documents the `/metrics` endpoint and the metric names exposed by the OpenAI-compatible server, while its OpenTelemetry proof-of-concept page shows how to export traces to Jaeger. Dash0’s May 5 guide adds an end-to-end example with an OTel Collector and a traced FastAPI RAG application. (docs.vllm.ai) (opentelemetry.io)