Core data skills still matter

Zach Wilson, a founder with big‑data experience at Meta, Netflix and Airbnb, argued on social that foundational data engineering skills—Spark, Airflow, data quality, SQL and Python—remain timeless despite rapid change. The thread pushed back on the idea that tooling churn makes those fundamentals obsolete and sparked discussion about modern stack essentials. (x.com)

A veteran data engineer’s latest argument is simple: the tools keep changing, but the core skills still run the job. (dataexpert.io) Zach Wilson, founder of DataExpert.io, says he led data engineering and software teams at Airbnb, Facebook and Netflix, and his public posts and courses keep returning to the same stack: Structured Query Language, Python, data modeling, data quality, orchestration and distributed compute. (dataexpert.io) His GitHub profile lists a learning roadmap in that order — Structured Query Language, data modeling, Python, data quality, distributed compute, orchestration and big-data tools — and frames his approach as business-first rather than tool-first. (github.com) The underlying work is less glamorous than the current artificial intelligence cycle makes it sound. Data engineering is the job of moving, cleaning and checking data so analysts, applications and machine-learning systems can use it reliably. (spark.apache.org) (airflow.apache.org) That is why the names in Wilson’s list are so ordinary. Apache Spark is a large-scale processing engine, Apache Airflow is workflow software for scheduling and monitoring batch jobs, Python is the general-purpose language many teams use to write pipeline code, and Structured Query Language is the language used to query and shape tables. (spark.apache.org) (airflow.apache.org) (docs.python.org) (postgresql.org) The “data quality” part is not one product. It is the practice of testing whether records are complete, unique, valid and linked correctly before bad data spreads into dashboards, experiments or models. (greatexpectations.io) (docs.getdbt.com) Airflow’s own documentation shows teams wiring quality checks directly into pipelines, including Structured Query Language checks that can stop a workflow when the data fails a rule. (astronomer.io) (airflow.apache.org) Spark and Airflow also remain current software, not museum pieces. Apache Spark’s latest documentation describes version 4.1.1, and Apache Airflow’s stable documentation describes version 3.2.0. (spark.apache.org) (airflow.apache.org) Wilson’s broader point has landed in a market full of new labels — modern data stack, analytics engineering, data products, artificial intelligence engineering — that often package the same old requirements under new interfaces. Even courses marketed around open-source data engineering still recommend basic Python and foundational Structured Query Language before students touch Spark, dbt or Airflow. (coursera.org) The argument is not that newer tools do not matter. It is that teams still need people who can read tables, write code, schedule jobs and catch broken data before it reaches the rest of the company. (github.com) (greatexpectations.io)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.