AWS thermal event knocks US‑East services
- Amazon attributed a recent multi‑hour disruption in US‑EAST‑1 to a 'thermal event' that caused a power loss at a Northern Virginia data center. (mashable.com) - The outage cascaded: Coinbase reported a related seven‑hour service interruption and CEO Brian Armstrong called the downtime 'never acceptable.' (crowdfundinsider.com) - Consumer shopping, platform services, and enterprise workloads saw outages as recovery stretched into a second day while Amazon restored capacity. (mashable.com) (www.el-balad.com)
AWS just gave the plainest explanation yet for this week’s US‑East disruption: a thermal event in one Northern Virginia data center cut power inside a single Availability Zone, use1‑az4, and that knocked over EC2 instances and degraded EBS volumes starting May 7 at 4:20 p.m. PDT. AWS says it shifted traffic away from the zone by 5:06 p.m., but recovery dragged on because some servers had shut themselves down once temperatures crossed safe operating limits. (health.aws.amazon.com) ### What actually failed? The important detail is that this was not framed as a whole‑region collapse. AWS described it as a single facility inside a single Availability Zone in us‑east‑1. That matters because AWS’s pitch has always been that an app spread across zones should survive one zone going bad. When one building gets hot, loses power, and starts shedding compute and storage, the real test shifts from AWS’s hardware to customers’ architecture. (health.aws.amazon.com) ### Why does “thermal event” matter? Because that phrase is AWS shorthand for a physical infrastructure problem, not a software bug. Turns out the failure chain here was brutally old‑school: temperatures rose, power was lost, and servers protected themselves by shutting down. Once that happens, recovery is slower than a normal control‑plane hiccup. You are not just rerouting requests. You are bringing back machines, checking storage health, and dealing with whatever state got stranded mid‑flight. (status.aws.amazon.com) ### Why did one zone hit so many companies? US‑East‑1 is the gravity well of the public cloud. It is AWS’s oldest, busiest region, and a lot of companies still anchor critical services there because it is where the most products, capacity, and historical deployments tend to live. The catch is concentration. Even if a company thinks it is “in the cloud,” a surprising amount of that cloud can still sit in one metro area, one region, or even one zone if the design is sloppy or capacity constraints nudged workloads into a narrow footprint. AWS’s own status language points directly at zonal impact, not regional impact. (health.aws.amazon.com) ### How bad was the Coinbase spillover? Coinbase’s public status page tied degraded performance directly to the AWS outage on May 7. Its incident log shows users unable to transact on web and mobile, then a later update saying Coinbase experienced service disruptions due to increased temperatures in the affected AWS service. The disruption window on the status feed ran from 5:56 p.m. PDT to 12:46 a.m. PDT — about 6 hours and 50 minutes, basically the “seven hours” people are quoting. (status.coinbase.com) ### So was AWS “down”? Not in the simple all‑or‑nothing sense. A lot of internet outage talk flattens everything into “AWS went down,” but AWS’s public updates describe a narrower event with broad downstream consequences. Some customers would have seen full outages. Others would have seen slow recovery, impaired volumes, or partial failures in specific services. That distinction matters because it tells you where resilience broke — inside AWS’s facility, inside the customer’s zonal failover plan, or both. (health.aws.amazon.com) ### Why did recovery last into the next day? Storage is the usual reason these incidents feel longer than the initial blast radius suggests. Compute can often be restarted or moved faster than attached state can be verified and restored. AWS kept saying it was dealing with impaired EC2 instances and degraded EBS volumes, which is a clue that the hard part was not just traffic steering but getting persistent infrastructure back into a safe, consistent state. (health.aws.amazon.com) ### What should companies take from this? The boring lesson is still the real one: multi‑AZ has to be real, tested, and automatic. Not a diagram. Not a checkbox. AWS is even pushing affected customers toward disaster recovery plans, remote backups in other regions, and traffic redirection away from impacted regions. That language reads like a reminder that cloud resilience is shared work — AWS keeps the platform running, but customers decide whether one hot building becomes a bad afternoon or a front‑page outage. (health.aws.amazon.com) Bottom line — this was a physical failure in a single AWS facility, but it exposed a much bigger digital truth. Cloud outages rarely stay “local” when too many important systems are stacked on the same piece of infrastructure. (health.aws.amazon.com)