AWS says thermal event cut power at Northern Virginia data center, triggering outage

- AWS said a May 7 thermal event in a Northern Virginia data center cut power in Availability Zone use1-az4, disrupting EC2, EBS, Coinbase, and FanDuel. - The trouble started at 5:25 p.m. PDT, and AWS warned Friday that full recovery would still take hours as added cooling came online. - It matters because us-east-1 is AWS’s oldest, busiest region, so a single-zone failure can still ripple far beyond one building.

Cloud outages usually sound abstract — some status page turns orange, some apps get flaky, people grumble online. But this one started with something very physical: too much heat in a Northern Virginia AWS data center, then a loss of power, then a long recovery in one of the internet’s most important cloud regions. That chain matters because AWS is supposed to turn building-level problems into software-level inconveniences. This time, a single Availability Zone problem still spilled outward into trading, betting, and other customer systems on May 7 and May 8. ### What actually broke? AWS said temperatures rose inside a single data center in the us-east-1 region, specifically Availability Zone use1-az4. That thermal event led to power loss on some hardware, which then impaired EC2 instances and EBS volumes hosted there. In plain English, servers and storage inside the affected slice of the region stopped behaving normally at the same time. (cnbc.com) ### Why did recovery take so long? Because this was not just a software rollback. AWS said it had to restore cooling capacity first, then recover affected racks in a controlled and safe way. That is the catch with heat and power incidents — you cannot just slam everything back on at once like rebooting a laptop. If hardware got too hot or lost power unevenly, bringing it back is part facilities work, part infrastructure triage. AWS was still warning on Friday that full recovery would take several more hours and that progress was slower than expected. (datacenterdynamics.com) ### Why did customers feel it so directly? Because a lot of companies still anchor critical workloads in us-east-1, AWS’s oldest and biggest region. AWS has multiple Availability Zones there, and the whole design idea is that one zone can fail without taking down an application. But that only works if the customer actually built for zone failure. If databases, queues, caches, or stateful services lean too hard on one zone, one building problem becomes a customer outage. (cnbc.com) ### Which companies were hit? Coinbase and FanDuel were two of the most visible examples. FanDuel told users it was dealing with technical difficulties and later tied the issue to the broader AWS outage. Coinbase said failures in multiple AWS zones caused an extended outage of core trading services, though it later said the primary issue was resolved. That wording is interesting — AWS publicly described a single-zone incident, but customer architectures can make one-zone damage show up in messier ways upstream. (datacenterdynamics.com) That is not a contradiction so much as a map of dependencies. ### Was this the whole region going down? No — and that distinction matters. AWS framed this as a single Availability Zone event inside us-east-1, not a region-wide collapse. But users do not experience outages in AWS vocabulary. They experience them in app vocabulary: can I place a trade, can I cash out, can I log in, can I reach storage? A “limited” cloud incident can still feel total if the app in front of you has no graceful fallback. (cnbc.com) ### Why does Northern Virginia keep coming up? Because Northern Virginia is one of the densest cloud and internet infrastructure hubs on earth, and us-east-1 is deeply embedded in how companies deploy. Datacenter operators and AWS customers keep coming back to the same tradeoff — the region is powerful, mature, and convenient, but concentration creates blast radius. DatacenterDynamics notes this region has seen major incidents before, including in October 2025 and earlier years. (cnbc.com) ### So what is the lesson here? The lesson is not “cloud is fragile.” Basically, the lesson is that resilience is something customers have to buy with design choices. Multi-AZ failover, region-agnostic queues, degraded read-only modes, and fewer hard dependencies on one hot spot all cost more and feel slower to build. But they are what turn a thermal event in one building into a non-event for users. (datacenterdynamics.com) ### Bottom line? AWS traced this outage to heat, cooling, and power in a single Northern Virginia facility. That sounds mundane, but it is the whole story: the cloud is still made of buildings, and buildings still fail in very old-fashioned ways. (cnbc.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.