Use AI for incident triage

- Microsoft, AWS, IBM and Elastic have published tools and guidance since October 2025 showing AI being used to investigate alerts, correlate telemetry and speed incident triage. - IBM Instana says its intelligent incident investigation can reduce MTTR by 60% to 80%, while Azure and AWS describe AI as an operational teammate. - Microsoft’s Azure Copilot observability agent remains in preview, and AWS DevOps Agent requires eligible support access through AWS account teams.

Microsoft, AWS, IBM and Elastic have all published recent material showing AI moving into incident response work that used to sit squarely with on-call engineers. The tools are aimed at the first phase of an outage — sorting signal from noise, correlating logs and metrics, tracing dependencies and surfacing likely root causes. The pitch is speed: shorten the time between an alert firing and an engineer forming a usable hypothesis. The vendors’ own descriptions also draw a line between assistance and replacement, with the systems positioned as investigation aids inside broader SRE and DevOps workflows. ### What are engineers actually using AI to do during an incident? Azure Monitor’s Copilot observability agent, which Microsoft lists as a preview feature, is designed to let engineers “run a deep investigation” from an Azure Monitor alert and ask natural-language questions against observability data. Microsoft says the agent correlates metrics, logs, alerts, tracing data and resource health signals to explain what changed and assess the scope and impact of an issue. (learn.microsoft.com) AWS said in a March 31, 2026 blog post that a 2 a.m. responder often has to manually correlate telemetry from multiple sources and trace dependencies across services before forming hypotheses. AWS framed its DevOps Agent as an “always-available operations teammate” for AWS, multicloud and on-premises environments, with the stated goal of cutting mean time to resolution from hours to minutes. (learn.microsoft.com) ### Where does anomaly detection fit in? Elastic said in an October 20, 2025 technical post that its observability workflow combines always-on machine learning with a generative AI assistant. The company said unsupervised models profile normal log throughput and content, then flag spikes and new log categories when they deviate from learned baselines. Elastic said the alert view can already connect a spike to a dominant new log pattern before an engineer starts manual hunting. (aws.amazon.com) Its AI assistant can then explain the anomaly in plain language, reference the underlying log events and propose next steps, according to the post. ### How are vendors describing root-cause analysis? IBM’s Instana documentation says its intelligent incident investigation uses agentic AI, causal AI analysis and large language models to identify a probable root cause across an environment. (elastic.co) IBM says the system analyzes topology, distributed tracing, application performance metrics, logs and infrastructure events in parallel. IBM said the investigation produces a failure propagation chain, a prioritized list of affected components and recommended remediation actions. The company said the feature is designed for DevOps, SRE, platform and operations teams and can reduce MTTR by up to 60% to 80%. ### Are teams also building their own on-call copilots? GitHub hosts an open-source sample, On-Call Copilot, built on Microsoft’s agent framework and hosted agents. (ibm.com) The repository says four specialist agents — Triage, Summary, Comms and PIR — run concurrently on alerts, logs, metrics and runbook excerpts, then return structured output including root-cause analysis, immediate actions, communications drafts and a post-incident report. AWS also contrasted production-grade operational agents with what it called a “thin wrapper” over a large language model. In its post, AWS said ad hoc use of coding tools can help in straightforward cases, but argued that large-scale environments require topology awareness, governance, access controls and retained learning from past incidents. ### Does this replace core DevOps and SRE work? (github.com) AWS said the information needed to resolve incidents is often scattered across logs, deployment pipelines, configuration histories and third-party monitoring tools. Microsoft’s Azure documentation similarly describes the agent as a way to explore telemetry and save investigation context as an issue for later review, rather than as a system that resolves incidents on its own. (aws.amazon.com) IBM’s documentation says its system starts an AI investigation from an incident “instead of manually running queries in multiple tools,” but still frames the output as investigative guidance and recommended actions for response teams. Elastic’s example likewise keeps a human in the loop, with the assistant translating questions into queries and charts to verify end-user impact. (aws.amazon.com) Microsoft says the Azure Copilot observability agent is currently in preview, while AWS says DevOps Agent access depends on an eligible support plan and contact with an AWS account team. The next steps for engineers are on those product pages and repositories: Azure’s investigation workflow in Azure Monitor, AWS’s DevOps Agent setup, IBM Instana’s investigation documentation and the open-source On-Call Copilot sample on GitHub. (learn.microsoft.com) (ibm.com)

Use AI for incident triage

Get your own daily briefing