Dev troubleshooting checklist

A DevSecOps engineer recommended a compact troubleshooting checklist that prioritizes logs → metrics → traces, recent changes, and runbooks — a practical sequence for finding failures fast. (x.com) The post has modest social traction (2 likes, 177 views), but the checklist is the kind of small, repeatable practice teams can fold into incident playbooks to speed root-cause analysis. (x.com)

A small post from a DevSecOps engineer laid out a troubleshooting order that many teams learn the hard way: start with logs, then check metrics, then follow traces, then ask what changed, then open the runbook. The post itself barely traveled. The idea should. That sequence works because incidents rarely fail in just one way. A broken service leaves behind several kinds of evidence, and each one answers a different question. Logs give the closest thing to a diary. They record events in order and often capture the exact error, the component that threw it, and the moment it happened. Kubernetes describes logs as a chronological record of events inside applications and system components, which is why operators still reach for them first when something looks wrong. (kubernetes.io) But logs alone can drown an engineer in detail. Metrics shrink the scene to a few hard numbers. They show whether error rate jumped, latency stretched, or request volume collapsed. Google’s SRE workbook treats metrics and structured logging as the core data sources for judging service health and diagnosing failures, and AWS makes the same distinction in plainer language: monitoring tells you that something is wrong, while observability helps explain what and why. That is the real value of moving from logs to metrics instead of treating them as rivals. One gives texture. The other gives shape. (sre.google) The next step matters because modern outages rarely stay inside one process. Traces follow a request across services, queues, and databases. OpenTelemetry calls distributed tracing essential for systems with behavior that is hard to reproduce locally, especially when failures are nondeterministic. Google Cloud’s root-cause walkthrough shows the practical use: an engineer sees an error-rate spike in a dashboard, opens a trace tied to the bad requests, and narrows the problem to a failing call inside one service. The trace does not finish the job, but it cuts the search space fast. (opentelemetry.io) That is where the checklist turns from observability theory into incident craft. After traces, it asks the question that solves an absurd number of production problems: what changed? Google’s example makes the logic explicit. If errors line up with a recent service update, rollback may be the fastest mitigation. If they do not, keep digging. In the same example, the trace-linked logs point to repeated database connection failures, and the eventual culprit is a recent configuration change. The checklist is compact because production failure is often compact too. A bad deploy, a flipped flag, a changed secret, a new dependency setting. (cloud.google.com) The last step, the runbook, is what turns one person’s memory into a team habit. AWS’s incident tooling has customers associate alarms with a runbook during incident management, which is a quiet acknowledgment that diagnosis is not enough. Teams need a prepared path from signal to action. Runbooks do not replace judgment. They save it for the parts that actually require thought. (docs.aws.amazon.com) The larger observability world usually talks about “three pillars” and sprawling telemetry stacks. Kubernetes uses that exact language for metrics, logs, and traces. OpenTelemetry says a system is properly instrumented when developers do not need to add more instrumentation just to troubleshoot an issue. The engineer’s post cuts through the abstraction. In an outage, nobody needs a philosophy lesson. They need an order of operations: logs, metrics, traces, recent changes, runbook. (kubernetes.io)

Dev troubleshooting checklist

Get your own daily briefing