SRE advice: chase fundamentals

An SRE voice urged engineers to prioritise fundamentals amid shifting trends—from cloud and DevOps to AI and ML—pointing to core skills that outlast fads. The post frames solid operational basics as more career‑durable than chasing every new platform or pattern. (x.com)

A veteran site reliability engineer used a July 2025 post to tell younger engineers to stop chasing every new label and keep building operational basics first. (x.com) The author, who posts as AskYoshik, framed the advice against a familiar sequence of industry waves: cloud, DevOps, artificial intelligence, and machine learning. The point was not that those shifts are fake, but that the underlying work of keeping systems stable, observable, and recoverable still decides who can run production safely. (x.com) Site reliability engineering is the discipline of running software in production after launch, when failures hit real users and real revenue. Google’s public SRE book puts service level objectives, monitoring, alerting, incident response, postmortems, troubleshooting, and simplicity at the center of that work. (sre.google) Google’s guidance says service level objectives set a target for reliability and help teams decide what work to prioritize. The same chapter says SRE work is driven by defending those objectives over the short, medium, and long term, not by automating “all the things” for its own sake. (sre.google) That view now sits inside mainstream cloud guidance, not just Google’s books. Amazon Web Services says operational excellence covers how teams organize, observe systems, automate safely, and make frequent, small, reversible changes. (docs.aws.amazon.com) Microsoft’s Azure guidance makes the same case in plainer production terms: failures will happen, distributed systems will break in pieces, and reliable workloads need resilience, recovery, availability, and operations designed in from the start. It tells teams to document clear goals for user experience and measure success against them. (learn.microsoft.com) The advice also cuts against a habit in engineering hiring cycles, where new tooling categories can look like career shortcuts. Google’s SRE material argues that reliability work is about balancing feature speed against service risk, using error budgets and agreed targets instead of fashion or intuition. (sre.google; sre.google) That does not mean newer tools are irrelevant. Amazon Web Services explicitly recommends automation and observability, and Microsoft lists observability and recovery planning as part of reliability design; the distinction is that those tools support fundamentals rather than replace them. (docs.aws.amazon.com; learn.microsoft.com) The through line in the post is older than the latest artificial intelligence cycle: teams still need to know what “good” looks like, how to detect drift, how to roll back change, and how to learn from outages. The labels keep changing, but the production job underneath them has not. (x.com; sre.google)

SRE advice: chase fundamentals

Get your own daily briefing