Classic ML caveat: data leakage

A widely cited paper, 'A Few Useful Things to Know About Machine Learning,' was resurfaced this week to warn practitioners about common traps—most notably data leakage, where training data inadvertently contains signals from the test set. (x.com) That paper’s checklist style is being promoted alongside modern teaching material as a quick guardrail for projects deploying models in production. (x.com)

Machine learning systems learn patterns from examples, and they fail fast when the examples quietly contain the answers. (homes.cs.washington.edu) That failure mode is called data leakage: information from outside the training process, including the test set or future data, slips into model building and makes results look better than they are. Scikit-learn’s documentation says leakage produces “overly optimistic performance estimates” and weaker real-world performance. (scikit-learn.org) Pedro Domingos’ paper “A Few Useful Things to Know About Machine Learning,” published in *Communications of the ACM* in 2012, warned that if test data influences the classifier, measured accuracy can be inflated and the estimate “will be optimistically biased.” The paper has remained a standard reading list item in university and industry machine learning courses. (homes.cs.washington.edu) The basic fix is procedural, not magical: split data into training and test sets first, then fit every transformation only on the training side. Scikit-learn gives feature selection as a concrete example, showing that running it on all data before the split can create a model that appears far better than chance on random labels. (scikit-learn.org) Modern teaching material frames the same problem in production terms. Google’s Machine Learning Crash Course includes “label leakage” and “training-serving skew” in its production monitoring guidance, alongside unit tests and data schemas meant to catch pipeline mistakes before deployment. (developers.google.com) That is why an older checklist keeps resurfacing. Domingos’ paper was written as a compact field guide to recurring mistakes in applied machine learning, from overfitting and nonrepresentative data to the “winner’s curse” of trying many models and reporting the best result. (homes.cs.washington.edu) The warning lands differently in 2026 because many teams now ship models as software products, not just research demos. Google’s current machine learning curriculum explicitly separates “problem framing,” “production ML systems,” and monitoring, reflecting a workflow where data handling errors can survive all the way into live systems. (developers.google.com, developers.google.com) Leakage is often mundane. A preprocessing step may normalize using the full dataset, a feature may encode information only known after the prediction date, or a cross-validation loop may let the same person, transaction, or document family appear on both sides of the split. (scikit-learn.org) The reason the advice travels so well is that it is tool-agnostic. Whether a team uses scikit-learn pipelines, Google’s course material, or a custom stack, the rule is the same: the model can only learn from information it would truly have at prediction time. (scikit-learn.org, developers.google.com) So the old caveat keeps coming back with new packaging. If the test set has already leaked into training, the model is not proving it can generalize; it is proving it has already seen the exam. (homes.cs.washington.edu, scikit-learn.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.