SRE agents keep failing
SRE practitioners report AI agents fall apart on novel incidents because of context drift — one operator runs 16+ agents and says live topology and knowledge graphs are essential for reliable RCA. (x.com) Practical response: training and cohort‑based upskilling — Visualpath announced an SRE batch starting March 23, 2026 to tackle these gaps. (x.com) (x.com)
Engineering posts since late 2025 have coined the term “agent drift” for failures where autonomous SRE agents lose situational context across long investigations and stop converging on correct root causes. (prassanna.io/blog/agent-drift/) Microsoft’s Azure SRE Agent product now advertises “Deep Context” — continuous access to runtime, repos and persistent memory — as a countermeasure so agents retain environment state across incidents rather than relying on one-off prompts. (techcommunity.microsoft.com/blog/appsonazureblog/azure-sre-agent-now-builds-expertise-like-your-best-engineer-introducing-deep-co/4500754) Field experiments and how‑to writeups show topology‑aware agents and incident knowledge graphs can collapse RCA time from tens of minutes to under a minute by correlating live topology with telemetry and past incident patterns. (dev.to/roops/topology-aware-ai-agents-for-observability-automating-slo-breach-root-cause-analysis-60i) Sherlocks.ai, led by Gaurav Toshniwal, markets autonomous SRE agents that continuously ingest historical incidents and chat logs to supply context-aware diagnoses, framing the gap in tools as a lack of explainable system knowledge rather than a model-capability shortfall. (techbullion.com/the-hidden-cost-of-system-outages/) Visualpath’s public schedule lists a Site Reliability Engineering cohort led by “Mr. Koti reddy” starting on 16 March 2026 with a 40‑day duration, signaling one vendor’s operational response of cohort-based upskilling rather than product-only fixes. (visualpath.in/upcoming-batches.html)