New practical MLOps resources
A set of practical MLOps posts surfaced: an end‑to‑end AWS deployment guide covering Docker, Terraform and CI/CD; a reminder that Netflix’s Metaflow powers thousands of internal projects and supports production scaling; and a three‑day live‑coding blueprint from a principal data scientist showing real system builds. (x.com) (x.com) (x.com) There was also a recent MLOps question focused on CI/CD for multi‑agent LLM apps, including prompt testing and canary rollouts. (x.com)
Machine learning operations is the work of turning a model into a service that can be tested, shipped, and rolled back, and a fresh batch of how-to material is now circulating around that job. (aws.amazon.com) Amazon Web Services published an October 8, 2025 guide that lays out an MLOps platform built with Terraform, GitHub, GitHub Actions, and Amazon SageMaker, including model registration and deployment to preproduction and production environments. The post describes a multi-account setup, code-repository based workflows, and automatic continuous integration and continuous delivery pipelines. (aws.amazon.com) A May 3, 2025 tutorial by Karan Verma for Docker walks through the same packaging-and-infrastructure pattern from the ground up, starting with a containerized inference app and then using Terraform to provision Amazon Web Services compute. The example moves from one container to multiple agents by changing Terraform variables, which is the kind of step-by-step build many engineers look for when they move past notebook demos. (dev.to) Netflix’s Metaflow covers a different part of the stack: it is a workflow framework for managing code, data, and compute across machine learning projects. Netflix wrote in a November 4, 2025 engineering post that Metaflow now powers a wide range of machine learning and artificial intelligence systems across the company, while Maestro powers nearly every machine learning and artificial intelligence system at Netflix as the orchestration backbone. (netflixtechblog.com) The Metaflow project site says Netflix open-sourced the framework in 2019, and the tool now runs on Amazon Web Services, Microsoft Azure, Google Cloud, and Kubernetes. The site also says Metaflow is used by hundreds of companies beyond Netflix, which helps explain why posts about it keep resurfacing in MLOps discussions. (metaflow.org) One recurring claim around Metaflow is scale. A 2022 conference talk description by former Netflix engineer Julie Amundson says the framework had been tested across thousands of machine learning projects since its 2019 open-sourcing, while a 2025 Netflix post describes company-wide use in broader terms without giving a project count. (gotopia.tech) (netflixtechblog.com) The newer pressure point is large language model systems that do more than answer one prompt. A recent engineering guide on agentic systems says standard continuous integration and continuous delivery assumptions break when software is non-deterministic, calls external tools, spawns sub-agents, and keeps state across sessions. (mindra.co) That is why current MLOps threads are converging on practical mechanics instead of abstract platform talk: containers to freeze dependencies, Terraform to recreate infrastructure, workflow tools to track runs, and staged releases to catch failures before full rollout. The common thread is not a new framework but a demand for repeatable builds that survive contact with production traffic. (aws.amazon.com) (dev.to) (metaflow.org) (mindra.co)