Practical capacity baselines

A recent guide recommends measuring basic system baselines—requests per second, latency, CPU and memory—then using load tests to find bottlenecks and setting alerts at around 70% utilization to avoid reactive firefighting. The checklist frames capacity planning as observable metrics, not guesswork, so teams can make visible trade-offs between throughput and headroom. (x.com)

Capacity planning starts with four numbers: requests per second, latency, central processing unit use, and memory use. Google’s Site Reliability Engineering guide says services should be measured on latency and traffic before teams try to fix overload. (sre.google) Microsoft’s Azure Well-Architected guide defines capacity planning as matching compute, memory, storage, and network resources to performance targets before traffic changes arrive. Its performance-testing guide says teams should set acceptance criteria such as maximum requests per second and central processing unit limits, then test against them. (learn.microsoft.com 1) (learn.microsoft.com 2) Load testing is the step where those baselines stop being estimates. Microsoft’s Azure Load Testing documentation defines requests per second as total requests divided by total test time, and says test runs should track response time alongside throughput as virtual users ramp up. (learn.microsoft.com 1) (learn.microsoft.com 2) The trade-off is simple: more throughput usually pushes up latency or resource use somewhere else. Google Cloud’s Bigtable performance guide tells operators to choose whether a workload is optimizing for low latency or high throughput, not assume both rise together without added capacity. (cloud.google.com) That is why bottlenecks matter more than averages. Google’s Site Reliability Workbook describes overload as a practical design problem, and its data-pipeline chapter lists central processing unit saturation, performance regressions, and out-of-memory failures among the concrete ways a single weak link can cap the whole system. (sre.google 1) (sre.google 2) The “70 percent” rule is not a law of physics, but major cloud vendors use it as a headroom marker. Google Cloud’s Bigtable monitoring guide recommends alerts for storage utilization above 70 percent, and a recent Kubernetes autoscaling guide shows how a 70 percent central processing unit target doubles replicas when measured load reaches 140 percent of requested capacity. (cloud.google.com) (alexandre-vazquez.com) Amazon Web Services uses a lower default target in one common case. Its Auto Scaling documentation gives 50 percent average central processing unit utilization as an example target for scaling out early, which reflects the same idea: leave room before customers feel the slowdown. (aws.amazon.com) Memory belongs in the baseline because it fails differently from central processing unit use. Amazon Elastic Container Service documentation says task sizing depends directly on central processing unit units and mebibytes of memory, while Amazon Aurora Serverless v2 says most scaling events are driven by memory usage and central processing unit activity. (docs.aws.amazon.com) (docs.aws.amazon.com) The operating model behind all of this is less about prediction than visibility. Measure the baseline, push the system with load, find the first resource that bends, and set alerts before that resource is full. (learn.microsoft.com) (sre.google)

Practical capacity baselines

Get your own daily briefing