Cleaning is most work

Practical analytics discussion on social is shouting one point loudest: most of data analysis is cleaning — one post framed it as roughly 80% of the effort, and urged analysts to first frame the question before building dashboards. (x.com) The same conversations push mastering SQL and comfortable dealing with messy data as gating skills for producing reliable decision support. (x.com)

The most repeated line in practical analytics right now is also the least glamorous one: the job is mostly cleaning. A social post that ricocheted through analyst circles this week put a number on it — about 80 percent of the work — and paired it with a sharper warning: if you do not define the question first, the dashboard is just decoration. That claim sounds like internet exaggeration until you look at the history of the field. For years, surveys of data scientists have found that cleaning and organizing data eats more time than modeling or visualization, including a widely cited 2016 CrowdFlower report in which three out of five respondents said cleaning and organizing data took up most of their time. More recent industry reports still describe the same bottleneck, even as AI tools promise to automate parts of it. (www2.cs.uh.edu) That persistence matters because “cleaning” does not mean wiping away a few typos. It means reconciling systems that were never designed to agree. One table says “CA,” another says “California,” a third logs free-text addresses, and a fourth records dates in a different format. IDs go missing. Rows are duplicated. One team counts orders when they are placed, another when they ship, and a third when payment clears. By the time an analyst opens a dashboard tool, the real work is often deciding what a customer, a sale, or an active user even is. Humanitarian data-cleaning guidance says messy data is full of inconsistencies caused by human error, bad recording systems, and imported formats that do not match. That is not an edge case. It is the normal state of operational data. (mandeguidelines.iom.int) That is why the advice to “frame the question first” is more than a workflow preference. A question determines grain, timeframe, and definitions. If the question is churn, you need a defensible definition of when a customer became inactive. If the question is campaign performance, you need to decide whether revenue belongs to click date, purchase date, or fulfillment date. Without that step, analysts can build polished charts that answer the wrong question with great precision. The dashboard looks finished because the software makes it easy to make something legible. It is much harder to make something true. Once the question is fixed, the skill that keeps coming up is SQL. That is not nostalgia. SQL remains the standard language for relational databases, and the ordinary tasks of cleaning depend on exactly the things SQL is built to do: joins to reconcile sources, aggregate functions to check totals, string functions to standardize text, and window functions to identify duplicates or rank records within a customer history. Official database documentation still reads like a catalog of data-cleaning chores because that is what analysts do all day. (docs.oracle.com) The social conversation has the tone of a pep talk, but underneath it is a gatekeeping truth. Reliable decision support does not come from being good at charts. It comes from being hard to fool. Analysts who are comfortable with messy data learn to distrust convenient numbers, trace fields back to source systems, and test whether a result survives basic sanity checks. That is why beginner portfolios full of immaculate Kaggle-style tables often miss the point. Real work starts when the columns are mislabeled, the timestamps drift, and the revenue total changes after a backfill. Even the new AI boom has not erased that fact. Anaconda’s 2024 state-of-data-science report found rising use of AI for data cleaning and task automation, which suggests the pain is large enough that teams will throw new tools at it. But automation changes the speed of cleaning more easily than it changes the need for judgment. A model can suggest a match between two records. It cannot decide what the business should mean by “customer” if three departments use the word three different ways. (anaconda.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.