300+ ML system designs compiled

A massive public repo collected 300+ real ML system designs from Google, Amazon, Microsoft, Netflix and about 80 other companies, offering a rare view into production‑scale architecture choices and tradeoffs. The collection is useful for engineering leaders benchmarking design patterns, failure modes and operational decisions. (x.com)

A GitHub repository that quietly gathered steam over the past year has become a kind of field guide to modern machine learning in production. The project, called “A Curated List of ML System Design Case Studies,” pulls together more than 300 writeups from over 80 companies, including Google, Amazon, Microsoft, Netflix, Airbnb, Uber, and DoorDash. On GitHub, it has climbed to about 10,000 stars and roughly 1,500 forks, which is a lot for what is essentially a reading list. (github.com) That popularity makes sense once you see what the repository actually is. It is not a pile of toy demos. It is a catalog of engineering blog posts, papers, and technical writeups about real systems that companies say they run in production. The maintainer describes the collection as a set of case studies organized by company and use case, with material spanning recommendation, search and ranking, fraud detection, computer vision, and natural language processing. (github.com) The important part is not the number 300. It is the word “curated.” Production ML is usually learned sideways. Engineers piece it together from conference talks, scattered blog posts, and tribal knowledge inside large companies. Even Chip Huyen’s long-running machine learning systems design materials make the same point: the best way to understand deployment constraints is to read case studies from teams that have already wrestled with them. (huyenchip.com) That is why this repository feels more useful than another “system design interview” cheat sheet. It exposes the awkward middle layer that polished demos leave out. The linked case studies tend to describe not just a model, but the surrounding machinery: feature pipelines, online serving, latency budgets, evaluation criteria, rollout choices, and the business metric the system was supposed to move. Evidently AI, which maintains a larger related database, uses almost exactly those criteria for inclusion: the writeup has to describe an in-house system, used in production, with enough detail on product design, evaluation, and deployment architecture to be worth studying. (evidentlyai.com) Seen that way, the repository is less a collection of breakthroughs than a map of recurring problems. Recommendation systems keep showing up. So do search ranking and fraud detection. Those categories dominate because they are where ML has to survive contact with users, money, and strict response times. The lesson is not that every company builds the same stack. It is that many of them run into the same constraints, then make different tradeoffs around speed, accuracy, observability, and operational complexity. (evidentlyai.com) There is also a quieter reason the list landed so well. Public writing from big tech companies has become one of the few places where outsiders can still inspect how large-scale ML systems are actually assembled. The repository turns that scattered literature into something closer to a benchmark set for engineering judgment. A technical leader can scan how Airbnb framed ranking, how Netflix discussed streaming-quality prediction, or how Booking.com talked about the gap between model performance and business performance, then compare those choices against their own team’s habits. (huyenchip.com) The repo is not complete, and it is not neutral. Any collection like this reflects what companies are willing to publish, which usually means successful systems and tidy retrospectives. But even that bias is informative. It shows which problems companies think are worth explaining in public, and which design patterns have become common enough to teach. The maintainer’s latest visible update was about eight months ago, yet the project keeps circulating because the underlying need has not changed: people want examples of ML systems that survived the mess of production. On GitHub, that need is now sitting in a plain README with about 10,000 stars. (github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.