ML infra reality: platforms and tooling
Experts are highlighting a small set of deployment platforms—Databricks for data-heavy pipelines, SageMaker/Vertex for enterprise scale, and Kubernetes for control—and stressing fundamentals beyond model training like feature stores and drift detection. Netflix also open-sourced Metaflow, which the company says supports thousands of ML projects with local-to-cloud scaling, versioning and orchestration without large rewrites. (x.com, x.com)
Machine learning infrastructure is settling into a short list of defaults: Databricks for data-heavy pipelines, Amazon SageMaker and Vertex AI for managed enterprise deployments, and Kubernetes for teams that want tighter operational control. (docs.databricks.com, docs.aws.amazon.com, cloud.google.com, kubernetes.io) A feature is a cleaned-up input a model can use, like a customer’s recent purchase count instead of a raw transaction log. Databricks, SageMaker, and Vertex AI all now document feature stores as central systems for creating, sharing, and serving those inputs in training and in live predictions. (docs.databricks.com, docs.aws.amazon.com, cloud.google.com) Drift detection is the check that asks whether live data has started to look different from the data a model learned from. Google’s Vertex AI says its feature monitoring can schedule jobs, retrieve feature statistics, and detect drift, while Vertex AI Model Monitoring can alert when metrics cross thresholds. (cloud.google.com, cloud.google.com) Amazon SageMaker makes the same case from a different angle: its documentation says feature stores simplify creating, storing, sharing, and managing features, and its feature processing tools turn raw batch data into reusable model inputs. Databricks describes its online feature store as a low-latency system for real-time models and says model serving can automatically look up precomputed features at inference time. (docs.aws.amazon.com, docs.aws.amazon.com, docs.databricks.com) Kubernetes sits lower in the stack. The project defines itself as an open-source system for automating deployment, scaling, and management of containerized applications, and its production guide says real clusters need secure access, availability, and capacity planning beyond a test setup. (kubernetes.io, kubernetes.io) That is why teams often split by tradeoff instead of chasing one universal platform. Managed services handle more of the plumbing, while Kubernetes gives operators more say over how workloads are packaged, scheduled, and run. (docs.aws.amazon.com, cloud.google.com, kubernetes.io) Netflix’s Metaflow fits into that picture as workflow software rather than a full cloud platform. Metaflow’s documentation says it was built to help data science, artificial intelligence, and machine learning projects move from local development to cloud scaling and production operation in Python. (docs.metaflow.org, docs.metaflow.org, docs.metaflow.org) Netflix says it open-sourced Metaflow in 2019, and a company engineering post published about five months ago said the framework now powers a wide range of machine learning and artificial intelligence systems across Netflix and other companies. The public GitHub repository showed about 10,000 stars and roughly 1,300 forks when crawled this month. (netflixtechblog.com, github.com, github.com) Metaflow’s pitch is less about replacing SageMaker, Vertex AI, or Kubernetes than about reducing rewrites between notebook experiments and scheduled jobs. Its scaling guide says flows can move to cloud compute with no code changes after infrastructure is configured, and its docs center versioning, debugging, and orchestration as first-class parts of the workflow. (docs.metaflow.org, docs.metaflow.org, docs.metaflow.org) The practical message from the tooling stack is narrower than the hype around model training. The durable work is in data pipelines, reusable features, monitoring, and deployment paths that survive contact with production traffic. (docs.databricks.com, docs.aws.amazon.com, cloud.google.com, kubernetes.io)