Stripe API Outage Causes Downstream Service Delays
An API reliability incident at Stripe on February 15 caused multi-hour email delivery delays and dashboard errors for its customers, including email service Resend. While no data was lost, the event highlighted the cascading impact of infrastructure downtime on developer-facing products. The incident serves as a reminder that reliability is a critical feature for API-first companies, and transparent postmortems are key to maintaining developer trust.
- A notable precedent for the February 15th outage occurred in March 2022, when Stripe experienced a three-hour API latency surge. During that incident, median latency for crucial endpoints increased from 120 milliseconds to over 3 seconds, causing timeouts and user frustration even though the system was technically still operational. The root cause was an unbalanced connection pool that saturated a database cluster, a situation worsened by a "retry storm" from cascading client requests. - The March 2022 event was categorized as a "gray failure," where traditional uptime metrics appeared normal, but the user experience was significantly degraded. This highlighted a key lesson for infrastructure resilience: monitoring for latency is as critical as monitoring for outright failures, as slow performance can be just as disruptive as a complete outage. - Stripe’s commitment to reliability is also evident in its architectural choices, specifically its use of date-based API versioning. This strategy ensures that code written against their API in 2011 would still function correctly today, preventing the breaking changes that often accompany traditional v1/v2 versioning and fostering long-term developer trust. - On February 4, 2026, Stripe's API services were degraded twice, first from 16:36 to 17:02 UTC and again from 21:14 to 22:47 UTC, leading to elevated error rates and response times. These earlier incidents in the same month also caused delays for some GBP bank account payouts, which were later resolved. - In response to a 24-minute API degradation in early February 2026, Stripe's CEO, Patrick Collison, personally commented on a Hacker News thread, stating, "We work hard to maintain extreme reliability in our infrastructure, with a lot of redundancy at different levels... We'll be conducting a thorough investigation and root-cause analysis." - The developer community's reaction to the early February outage on Hacker News was largely one of understanding, citing Stripe's strong track record on reliability. One user commented, "This is causing a big problem for my business right now, but I am not mad at Stripe because you earned that level of credibility and respect in my opinion." - Prior to the February 15th outage, Stripe had announced a planned trust chain update for its TLS certificates, scheduled to begin on February 16, 2026. This required developers who perform certificate pinning to update their configurations to avoid disruptions. - Stripe's engineering culture emphasizes a proactive approach to preventing outages, with one of its core tenets being to "practice your worst days every day." This involves intentionally pushing systems to their breaking points to understand and mitigate potential failure modes.