KV cache is the bottleneck — quote
“With vLLM and TensorRT-LLM, you can saturate your GPUs on prefill, but if your KV cache strategy isn’t bulletproof, you’re leaving latency improvements on the table,” said a panelist summarizing inference realities—panels also pushed inference graph partitioning and new controllers like KubeLLM to cut costs. AWQ is getting called out as a low-latency quant for agent workloads, while QLoRA still rules fine-tuning accuracy trade-offs. (youtube.com) (x.com)
Panels at the session flagged KV-cache pressure as the recurring production limiter: vLLM exposes preemption warnings when KV cache runs out and recommends tuning gpu_memory_utilization and tensor/pipeline parallelism to avoid restarting prefills. (github.com) NVIDIA’s TensorRT‑LLM docs and recent research both show KV cache reuse, offloading, and loading I/O are the choke points that negate CPU/GPU prefill gains unless cache strategies (offload, prioritized eviction, paged loaders) are applied. (nvidia.github.io) Speakers pushed inference-graph partitioning as a response: HexGen‑2 and Berkeley work demonstrate graph/phase-aware partitioning and dynamic partition switching to co‑optimize compute, comms, and inter‑phase KV cache movement. (iclr.cc) Operational panels also recommended Kubernetes-aware controllers and model operators for cost control — papers and projects (KubeLLM for automated Kubernetes troubleshooting, llm-d/KServe and KubeAI for KV‑cache‑aware scheduling and disaggregated prefill/decode) were cited as practical levers to lower ops and GPU spend. (ieeexplore.ieee.org) AWQ (Activation‑aware Weight Quantization) was highlighted as the go‑to low‑latency quant for agent-style workloads because it reduces VRAM ~3–4× while preserving quality in many PTQ benchmarks, with vLLM noting AWQ’s current sweet spot is low‑concurrency, low‑latency inference. (github.com) Panels left QLoRA as the dominant production pattern for fine‑tuning trade‑offs: the original QLoRA method enables finetuning up to 65B on a single 48GB GPU while preserving near‑full 16‑bit task performance, and recent community benchmarks still cite QLoRA as the accuracy‑preserving, cost‑efficient baseline. (arxiv.org)