25 reliability practices shared

Akshay Shinde posted a viral list of 25 reliability practices for engineering teams that names concrete tactics like structured logs, circuit breakers, chaos testing, and an observability culture. The thread has been framed as a practical checklist for improving runbooks and platform reliability across teams. (x.com)

A 25-point reliability checklist from engineer Akshay Shinde is spreading across software teams as a practical guide to keeping services up and incidents shorter. (x.com) The list groups familiar Site Reliability Engineering habits into concrete actions, including structured logs, circuit breakers, chaos testing, runbooks, and an “observability culture.” Google’s Site Reliability Engineering material describes the same backbone: service level objectives, practical alerting, incident response, testing for reliability, and postmortems. (x.com) (sre.google) Reliability work is the part of software engineering that tries to keep an app usable when code, networks, or dependent services fail. Google’s workbook says service level objectives set a target for reliability, and those targets drive engineering decisions instead of gut feel. (sre.google) Several items in the thread focus on observability, which means collecting signals that show what a system is doing from the outside. OpenTelemetry defines those signals as logs, metrics, and traces, and IBM describes them as the three pillars teams use to understand system state. (opentelemetry.io) (ibm.com) Structured logging appears on the checklist because plain-text logs are hard for machines to search at scale. Better Stack says structured logs turn events into machine-parsable records with fields like timestamp, severity, and service name, which makes filtering and correlation faster during incidents. (betterstack.com) (martinfowler.com) Circuit breakers show up for a different reason: they stop one broken dependency from dragging down everything around it. Microsoft’s Azure Architecture Center says a circuit breaker temporarily blocks calls to a failing service after repeated errors, preventing repeated retries and reducing cascading failures. (learn.microsoft.com) Chaos testing, another item in the post, means injecting controlled failure into a live-like system to see what breaks before customers do. Google’s Site Reliability Engineering book includes “Testing for Reliability,” and current observability guidance says chaos experiments work best when teams can measure logs, metrics, and traces during the test. (sre.google) (last9.io) The checklist also leans on process, not just tooling. Google’s postmortem guidance says the goal after an incident is to document what happened, understand contributing causes, and put preventive actions in place rather than assign blame. (sre.google) That mix of design patterns, telemetry, and operating habits reflects how reliability work has moved from a specialist function to a cross-team discipline. Thoughtworks writes that growing companies often start with reactive fixes, then hit a point where resilience and observability have to become part of architecture, product decisions, and team routines. (martinfowler.com) The reason a checklist like this travels is simple: most teams do not fail from one missing dashboard or one bad deploy. They fail from small gaps across logging, alerting, rollback, testing, and learning after incidents — the exact gaps the 25-item list tries to name in one place. (x.com) (sre.google)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.