AI cloud stack anatomy
A detailed breakdown says modern AI clouds stack GPUs/TPUs for heavy compute, InfiniBand‑class networking, data lakes for storage and Kubernetes/MLOps for orchestration—this is the backbone enabling scalable LLM/autonomy workloads. Providers called out by name: AWS SageMaker, GCP Vertex AI and CoreWeave as concrete platform examples. (x.com)
CoreWeave advertises NVIDIA HGX H100 and H200 capacity with product claims of up to 9x faster training and up to 30x faster inference versus HGX A100 on its platform pages. (coreweave.com) The company reported operating more than 250,000 NVIDIA GPUs across roughly 32 data centers and has struck multi‑billion dollar infrastructure deals, including an OpenAI agreement described with potential value up to $11.9 billion. (techcrunch.com) AWS offers P5/P5e EC2 instances powered by NVIDIA H100 GPUs and added ml.p5 support to SageMaker training jobs to let customers run H100-based training inside SageMaker. (aws.amazon.com) SageMaker surfaces MLOps primitives — SageMaker Pipelines, Studio and model monitoring — and AWS provides Elastic Fabric Adapter (EFA) networking to enable low‑latency RDMA for scaled distributed training. (docs.aws.amazon.com) Google’s Vertex AI is integrated with Cloud TPU infrastructure including TPU v5p pods (8,960 chips per pod) and is presented as a component of Google’s AI Hypercomputer architecture for large‑scale training. (docs.cloud.google.com) Vertex AI Pipelines runs as a serverless, Kubeflow‑based orchestration service for production MLOps and supports building, compiling and running DAG-style ML workflows on GCP. (docs.cloud.google.com) AI clusters increasingly rely on InfiniBand NDR (400Gb/s) fabrics that double HDR (200Gb/s) performance and enable in‑network compute and collective‑operation offload via NVIDIA Quantum‑2 switches. (docs.nvidia.com) Object data lakes remain the canonical store for training and artifacts: SageMaker maps training datasets and checkpoints to Amazon S3 (and FSx/EFS options), while Vertex AI recommends Cloud Storage buckets and Cloud Storage FUSE for mounted access to multi‑TB datasets. (docs.aws.amazon.com) Kubernetes now includes stable GPU scheduling via device plugins and the NVIDIA GPU Operator automates drivers, device plugins and runtimes to provision GPUs at scale inside clusters. (kubernetes.io) Commercial AI‑focused clouds position themselves around these pieces: CoreWeave bills a Kubernetes‑native developer experience and curated H100/H200 capacity, AWS ties SageMaker to EFA and P5 instances, and GCP ties Vertex to TPU pod and serverless pipeline orchestration. (computeprices.com)