Microsoft 365 Services Experience Degradations
Microsoft's M365 services are experiencing multiple service degradations, according to recent status reports. The issues include outages affecting Teams bots and instability in Outlook, particularly in the Asia-Pacific region. In contrast, GitHub has maintained 100% uptime over the same period with only minor, quickly resolved incidents.
- A recent Microsoft Teams outage was attributed to a deployment that contained a broken connection to an internal storage service, which had downstream impacts on other integrated services like Word, Office Online, and SharePoint Online. - The instability in the Asia-Pacific region was caused by a routing fault within Microsoft's infrastructure that handled traffic for Japan, leading to login failures and degraded app performance for several hours. Microsoft engineers resolved the issue by rebalancing traffic across backup systems. - Outages of AI-assisted development tools like Microsoft Copilot have been linked to unexpected traffic spikes overwhelming the service's autoscaling infrastructure, requiring manual scaling and load-balancing adjustments to restore functionality. Such incidents disrupt developer workflows that rely on AI for tasks like code completion and documentation lookup, forcing a reversion to manual processes. - For an engineering manager, a key strategy during a third-party outage is to establish a clear communication plan, such as having designated backup communication channels, to prevent information silos when primary tools like Teams are unavailable. This ensures that frontend and backend teams can maintain alignment on priorities and blockers. - When collaboration tools are down, frontend development teams can mitigate productivity loss by using version control systems like Git for asynchronous work and leveraging tools such as Visual Studio Live Share for real-time pair programming and debugging. - GitHub's approach to high availability involves an "Engineering Fundamentals" program that focuses on durable ownership of software assets, incident readiness to guide on-call engineers, and iterative simplification to improve performance and prevent degraded experiences. - The increasing dependency on a single cloud provider is a recognized risk, with industry analysts noting that concentrating an entire digital infrastructure with one vendor creates a significant single point of failure. A multi-cloud strategy is often recommended to mitigate these risks and improve resilience. - Microsoft's financially backed Service Level Agreement (SLA) for Office 365 promises 99.9% uptime for its business customers. Historically, for the four quarters between July 2012 and June 2013, the reported worldwide uptime averages for Office 365 were 99.98%, 99.97%, 99.94%, and 99.97% respectively.