Day‑to‑day engineering realities

- Talks and posts from ex‑Big‑Tech staff emphasise that on‑call is noisy, documentation is imperfect, and escalation is normal. - A CS systems talk and related posts lay out how real incidents reveal system behaviour and team process gaps. - Interview answers that include observability, runbooks, escalation paths and imperfect realities read as more operationally credible (x.com).

Software breaks in ordinary ways, and the most credible engineers say the ordinary work is noisy alerts, partial dashboards, stale docs, and fast escalation. (usenix.org) In a 2026 SREcon talk, Atlassian senior site reliability engineer Jack Kingsman described incident work as a sequence of detection, triage, diagnosis, and testing, with responders told to “buy time in reversible ways,” keep notes, and know their observability tools. The slides also warn engineers not to stop at the first plausible explanation and to treat gaps and inconsistencies in telemetry as evidence. (usenix.org) A 2025 SREcon talk by Rachel Silber made the same point from the review side: effective incident reviews need the right people, the right systems in scope, and a timeline that starts before the incident was formally declared. That framing treats outages as windows into how teams actually work, not just as isolated technical failures. (usenix.org) That is closer to day-to-day operations than the polished version engineers often give in interviews. Amazon Web Services’ Well-Architected guidance says runbooks should define what starts escalation, who owns each action, and when a human decision must be preapproved so recovery time is not spent “waiting for a response.” (docs.aws.amazon.com) Atlassian’s incident guidance is even more direct: the first on-call engineer often cannot fix a problem alone, and larger teams need explicit policies for who gets paged next, by severity, duration, and scope. In other words, escalation is not a failure of competence; it is part of the system design. (atlassian.com) Post-incident accounts from senior engineers show what that looks like under pressure. In a 2023 InfoQ article, Erin Doyle described a Terraform change that triggered customer-impacting data deletion, late alerts, a second incident created during triage, and a three-day recovery involving senior and staff engineers across most of the platform. (infoq.com) That account also undercut the fantasy that good teams run on perfect documentation and clean handoffs. Doyle said the company had no centralized system for applying Terraform changes, which reduced visibility into what changed, who changed it, and when. (infoq.com) The practical language that keeps showing up is observability, runbooks, escalation paths, and incident notes. Kingsman’s slides tell responders to know the hierarchy of their dashboards and logs, annotate waits with timestamps, and ask questions that “drive answers, not silence.” (usenix.org) That is why interview answers that mention noisy on-call rotations, incomplete docs, and knowing when to pull in another team tend to sound more believable than answers built around flawless ownership. The industry’s own incident playbooks, conference talks, and postmortems all describe engineering as coordination under uncertainty, not solo mastery. (docs.aws.amazon.com)

Day‑to‑day engineering realities

Get your own daily briefing