MLOps roadmap: infra to inference
A wave of engineering resources this week—Tech with Mak’s 2026 AI engineering roadmap, DevOpsCube’s GPU‑Kubernetes guides, and Manning’s platform engineering book—reiterated the stack: CI/CD, containerized GPU workloads, quantization and caching as the path from prototype to reliable inference. The guidance centers on operationalizing RAG/agents and making GPU costs predictable. (x.com/techNmak/status/2034946111694131639) (x.com/devopscube/status/2035204783728968069) (x.com/ManningBooks/status/2034993617362644998)
DevOpsCube published a step‑by‑step NVIDIA GPU Operator on Kubernetes guide on March 4–9, 2026 that walks through installing the NVIDIA GPU Operator, verifying GPU visibility in nodes, and deploying a validation LLM workload using Ollama. (devopscube.com). (devopscube.com) Manning’s Effective Platform Engineering, co‑authored by Ajay Chankramath and others and released in October 2025, dedicates chapters to building secure, Kubernetes‑based developer platforms with explicit guidance on service‑level objectives and modern control‑plane patterns. (manning.com). (manning.com) vLLM’s engineering posts and community writeups describe hierarchical KV‑cache and prefix‑caching techniques that page KV blocks between GPU, CPU and remote stores to cut time‑to‑first‑token and sustain throughput; vLLM benchmarks and Microsoft commentary report up to ~24× throughput gains over vanilla Hugging Face Transformers in some workloads. (vllm.ai). (vllm.ai) QLoRA/LoRA workflows now make fine‑tuning very large models feasible on single‑GPU rigs — QLoRA papers and guides show 4‑bit quantization plus LoRA can fine‑tune 65B‑class models on a 48‑GB GPU and report memory savings on the order of ~4× (up to ~75% VRAM reduction) versus full‑precision training. (clay‑atlas.com). (clay-atlas.com) NVIDIA’s TensorRT‑LLM repo and docs provide production recipes for INT8/FP8 quantization, multi‑GPU parallelism and KV‑cache exchange, while multiple community benchmarks show speculative decoding can add ~1.4–3× token‑generation speedups depending on model pairings and acceptance rates. (nvidia.com). (nvidia.github.io)