HCP Vault Stays Available During AWS Regional Outage
HashiCorp's HCP Vault service reportedly maintained data plane availability during a recent AWS regional outage. The resilience is attributed to an architecture that separates the control and data planes. This design allows critical functions to continue operating even when the cloud provider's management console or APIs are unavailable.
- The specific event that validated HCP Vault's resilience was a significant AWS us-east-1 regional outage on October 20, 2025. This outage caused elevated error rates and intermittent panics in the HCP Vault control plane, which is hosted in that region. - Despite the control plane impact, HCP Vault Dedicated customer clusters maintained 100% uptime and zero downtime for the data plane across all regions and cloud providers, including those running in us-east-1. This fulfilled their 99.99% SLA. - The core architectural principle that enabled this resilience is the separation of the control plane, which handles administrative functions like cluster creation and configuration, from the data plane, which manages secrets and API calls. - While the data plane remained fully operational, the outage did cause transient issues for some control plane administrative workflows, such as creating new snapshots, fetching audit logs, or adding new backup regions. - This architectural separation allows the data plane to have fewer moving parts and less complexity than the control plane, which often involves intricate workflows and databases, statistically reducing the likelihood of a failure event in the data plane. - The us-east-1 region has a history of significant outages, including events in 2011, 2017, 2020, and 2021 that impacted major services like EC2, S3, and Kinesis. The October 2025 outage was attributed to DNS system issues within the region. - HashiCorp's design for high availability includes 3-node clusters for production tiers and the ability to configure cross-region disaster recovery, allowing for failover to a backup network in the event of a regional outage. - The separation of planes is a recognized best practice for building resilient, scalable, and secure distributed systems, allowing each plane to scale and evolve independently.