Cut MTTD/MTTR 80% with AI

- Microsoft, Amazon Web Services, Datadog and incident.io are pushing AI site reliability agents that investigate alerts, correlate telemetry and draft fixes. - Vendors and users say the biggest gains come before repair: Struct says triage falls from 45 minutes to 5 minutes, cutting MTTR 80%. - The field is moving toward approval gates, not full autonomy, as vendors add governed remediation and human signoff. (learn.microsoft.com)

Site reliability engineering is the practice of keeping software up when it breaks. The new twist is AI agents that do the first round of detective work across logs, metrics, traces and recent code changes. (learn.microsoft.com) (aws.amazon.com) Microsoft’s Azure SRE Agent says it continuously watches resources, investigates incidents and can automate remediation tasks to lower mean time to resolution. Datadog’s Bits AI now offers alert investigations that pull in traces, logs, metrics and recent changes inside the incident workflow. (learn.microsoft.com) (docs.datadoghq.com) Amazon Web Services published a reference design for a multi-agent SRE assistant built with four specialist agents under a supervisor. Its example setup splits work across Kubernetes, logs, metrics and runbooks so one system can answer questions like why payment-service pods are crash looping. (aws.amazon.com) The appeal is simple: most outage time is not spent typing the fix. It is spent gathering context, checking six dashboards, comparing deploy history and deciding which clue matters. (engineering.razorpay.com) (incident.io) Razorpay said one payment incident took 32 minutes of manual investigation across six systems before engineers traced the problem to a bad deployment. The company built a multi-agent Oncall Agent after finding that investigation alone was consuming 20 to 40 minutes per incident across 15 to 20 weekly incidents. (engineering.razorpay.com) Struct, an on-call triage startup, says teams using AI agents cut investigation time from 45 minutes to 5 minutes and reduce mean time to resolution by 80%. incident.io says customers are seeing MTTR reductions of up to 80% by removing the coordination work that fills the first 10 to 15 minutes of an incident. (blog.struct.ai) (incident.io) That 80% figure is not a single industry benchmark. It is a vendor-and-user claim that shows up across product blogs, customer stories and early deployments, with the biggest savings coming from triage, coordination and root-cause analysis rather than from fully automatic repair. (blog.struct.ai) (incident.io) (infoq.com) The architecture is also settling into a pattern. One agent gathers logs, another checks metrics, another reviews runbooks or code changes, and a supervisor decides what to ask next and how to summarize the evidence. (aws.amazon.com) (opsworker.ai) The part vendors are most careful about is action. Microsoft says Azure SRE Agent requires approval before taking actions on a user’s behalf, and incident.io recommends human approval before execution in early deployments to build trust. (learn.microsoft.com) (incident.io) Google Cloud’s own SRE write-up frames AI the same way: assist operators during an outage without taking away control. InfoQ’s coverage of OpsWorker reached a similar conclusion, arguing that the value is orchestration around the human on call, not handing the pager to a machine. (cloud.google.com) (infoq.com) The near-term change is not self-healing software that runs unsupervised. It is faster, evidence-backed investigation that turns a 30-minute hunt into a five-minute briefing before a human decides what to do next. (blog.struct.ai) (learn.microsoft.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.