KV cache is the bottleneck — quote

“With vLLM and TensorRT-LLM, you can saturate your GPUs on prefill, but if your KV cache strategy isn’t bulletproof, you’re leaving latency improvements on the table,” said a panelist summarizing inference realities—panels also pushed inference graph partitioning and new controllers like KubeLLM to cut costs. AWQ is getting called out as a low-latency quant for agent workloads, while QLoRA still rules fine-tuning accuracy trade-offs. (youtube.com) (x.com)

Panels at the session flagged KV-cache pressure as the recurring production limiter: vLLM exposes preemption warnings when KV cache runs out and recommends tuning gpu_memory_utilization and tensor/pipeline parallelism to avoid restarting prefills. (github.com) NVIDIA’s TensorRT‑LLM docs and recent research both show KV cache reuse, offloading, and loading I/O are the choke points that negate CPU/GPU prefill gains unless cache strategies (offload, prioritized eviction, paged loaders) are applied. (nvidia.github.io) Speakers pushed inference-graph partitioning as a response: HexGen‑2 and Berkeley work demonstrate graph/phase-aware partitioning and dynamic partition switching to co‑optimize compute, comms, and inter‑phase KV cache movement. (iclr.cc) Operational panels also recommended Kubernetes-aware controllers and model operators for cost control — papers and projects (KubeLLM for automated Kubernetes troubleshooting, llm-d/KServe and KubeAI for KV‑cache‑aware scheduling and disaggregated prefill/decode) were cited as practical levers to lower ops and GPU spend. (ieeexplore.ieee.org) AWQ (Activation‑aware Weight Quantization) was highlighted as the go‑to low‑latency quant for agent-style workloads because it reduces VRAM ~3–4× while preserving quality in many PTQ benchmarks, with vLLM noting AWQ’s current sweet spot is low‑concurrency, low‑latency inference. (github.com) Panels left QLoRA as the dominant production pattern for fine‑tuning trade‑offs: the original QLoRA method enables finetuning up to 65B on a single 48GB GPU while preserving near‑full 16‑bit task performance, and recent community benchmarks still cite QLoRA as the accuracy‑preserving, cost‑efficient baseline. (arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.