Kubernetes is table stakes
A recent thread and tutorials reinforce that Docker+Kubernetes pipelines, autoscaling, and infra‑as‑code are now baseline for shipping RAG systems quickly and cost‑efficiently — with ClickHouse's ClickStack lessons underscoring K8s tradeoffs and 1PB S3 cleanup headaches. The practical upshot: standardize Helm/CI, tune GPU node affinity and autoscaling, and automate infra to cut operational debt. (x.com) (x.com) (youtube.com)
Recent how‑tos and vendor blueprints treat containerized Docker+Kubernetes pipelines as the default deployment pattern for RAG stacks, with NVIDIA’s RAG blueprint and multiple engineering guides laying out Kubernetes-based retrieval + vector DB + LLM inference architectures. (developer.nvidia.com) Production guides and the Kubernetes docs show GPU scheduling requires explicit node labelling, nodeAffinity, taints/tolerations, the NVIDIA GPU Operator and device plugins to avoid idle GPUs and topology mismatches. (kubernetes.io) Autoscaling inference in practice relies on KEDA or custom metrics plus the cluster autoscaler and “cost circuit breakers” to scale by queue depth or GPU utilisation rather than raw HTTP RPS, as recent posts and cloud vendor docs recommend. (markaicode.com) Standardizing Helm charts and pushing deployments through CI + GitOps (Argo CD or Flux) while provisioning clusters with Terraform is the common playbook cited in multiple CI/CD and GitOps walkthroughs. (dev.to) ClickHouse’s ClickStack launch explicitly targets petabyte-scale observability on ClickHouse, but ClickHouse docs and community posts warn that S3-backed MergeTree setups can produce orphaned S3 objects and complicated cleanup at scale. (clickhouse.com) Community issue threads and engineering notes show orphaned S3 blobs can persist after crashes or replica loss and that ClickHouse currently relies on lifecycle policies or external reconciliation tooling to reclaim those objects. (github.com) Operationally: codify Helm charts in CI pipelines, use GitOps/ArgoCD + Terraform for infra changes, tune GPU scheduling with MIG/nodeAffinity and KEDA-based autoscaling, and add automated S3 reconciliation or lifecycle rules if you run ClickHouse on S3 to avoid unexpected petabyte‑scale bills. (terrateam.io)