SRE rulebook: SLO math

Practical SRE guidance is trending: 99.5% SLI equals a 3.6‑hour monthly error budget — and the heuristic shared was blunt: ship if your error‑budget burn is >50%, fix urgently if it's <10% (x.com). Those concrete thresholds are being pushed as tools to make release vs. rollback debates less emotional and more measurable (x.com).

Most production SRE teams compute error budgets on a rolling 30‑day window and derive the budget mathematically as (1 − SLO) × window minutes, a method described in Google’s SRE workbook and reinforced in example SLO docs. (sre.google) Operational playbooks implement multi‑window burn‑rate alerting with separate short and long lookbacks because Google Cloud’s monitoring docs normalize burn rate so a value above 1 indicates the current pace will exhaust the SLO if sustained. (docs.cloud.google.com) Implementation guides convert percent‑of‑budget bands into discrete actions — labeled variously as “normal shipping,” “restricted shipping,” and “halt and fix” — and vendor/consulting posts publish sample threshold mappings to make those gates actionable. (stew.so) SRE tooling uses burn‑rate multipliers to translate short‑window error rates into percent‑of‑budget alerts; the common 14.4× multiplier flags a condition that would consume roughly 2% of a 30‑day error budget in one hour. (slo.foo) Google’s example error‑budget policy requires a postmortem if a single incident consumes more than 20% of a four‑week budget and pauses non‑critical releases until the service is back within SLO. (sre.google) Observability and CI/CD vendors including Datadog and New Relic, plus open calculators such as Project Helena, provide automated error‑budget tracking and burn‑rate alerting so teams can enforce those policy gates inside release pipelines. (datadoghq.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.