Stability Sim for outages
AWS Distinguished Engineer Marc Brooker released 'Stability Sim', an interactive tool for exploring distributed‑system outage pitfalls like cache failures, retry storms and load‑balancer behaviour. The tool is framed as a hands‑on way to model failure modes engineers face when designing resilient services. (x.com)
A distributed system is a service split across many machines, and the hard part is that small failures can multiply as traffic bounces between them. Marc Brooker on April 15 released Stability Sim, a browser tool for testing those failure chains interactively. (stability-sim.systems, aws.amazon.com) The site lets users drag in clients, load balancers, servers, caches, databases, and queues, then add timed failure scenarios and watch latency percentiles and throughput change on a dashboard. The live app also includes save and load controls, speed controls, random seeds, and example setups. (stability-sim.systems) Brooker is a vice president and Distinguished Engineer at Amazon Web Services, where the company says he has spent 16 years working on Elastic Compute Cloud, Elastic Block Store, Lambda, and Aurora Distributed Structured Query Language. His public writing has focused for years on outages, retries, caches, and other failure behavior in large services. (aws.amazon.com, brooker.co.za) In plain terms, the simulator tackles a common cloud problem: a system that looks fine in normal traffic can fail in a feedback loop once one part slows down. Brooker wrote in 2021 that higher latency raises concurrency, and higher concurrency can raise latency again until goodput drops. (brooker.co.za) Caches are one example because they speed up reads when warm, but an empty cache can suddenly shove work back onto a database that was never sized for that spike. Brooker wrote that a cache-heavy system can settle into a “happy loop” when the cache is full or a “sad loop” when the cache is empty and stays empty. (brooker.co.za) Retries are another example because clients often send the same request again after a timeout, which adds traffic to a service that may already be overloaded. Brooker wrote in 2022 that at a 100 percent failure rate, a policy of N retries makes the system do 1+N times as much work. (brooker.co.za) Load balancers and circuit breakers can also help or hurt depending on how they shift traffic and failures between components. Brooker wrote that client-side circuit breakers may make partial outages worse, and that engineers often need simulation because closed-form reasoning about these interactions is difficult. (brooker.co.za, brooker.co.za) Brooker has argued for small simulations as a practical design tool, separate from formal verification systems such as TLA+, because they let engineers explore system dynamics before code reaches production. He published a small simulator example on GitHub and wrote in 2022 that even basic numerical methods can surface surprising behavior. (github.com, brooker.co.za) That makes Stability Sim less like a benchmark and more like a sandbox for outage drills: change the topology, inject a failure, and watch whether the service recovers or spirals. The release turns several years of Brooker’s public writing on metastability, retries, and overload into a hands-on model engineers can run in a browser. (stability-sim.systems, brooker.co.za, brooker.co.za)