SRE shares production‑AI lessons

Principal SRE Saurabh Hirani summarised practical lessons on production‑ready retrieval‑augmented generation (RAG), agentic development, using AI to reduce mean time to resolution, agent editor communications, and emerging LLM ops approaches. The thread aggregates operational tactics for running models and agents reliably in production environments. (x.com)

Retrieval‑augmented generation is a way to make a model answer from outside documents instead of memory alone, and Saurabh Hirani said production use starts with boring operational guardrails, not prompts. (x.com) Hirani, a principal site reliability engineer at One2N, has spent nearly two decades working on high‑traffic systems across data centers and cloud platforms, according to a January 31, 2026 Luma event page. A September 27, 2025 One2N meetup listing also described his talk as “AI Enabled SRE Practices.” (luma.com) (youtube.com) In that meetup talk, Hirani framed the day job first: monitoring, incident response and scaling, then the mess around them, including alert fatigue, legacy runbooks, inconsistent root‑cause analysis and ad hoc “ClickOps” changes in dashboards, chat tools and paging systems. He listed alert categorization, runbook suggestions and root‑cause analysis templates as practical places where artificial intelligence can help. (youtube.com) Site reliability engineering is the discipline that keeps software available during failures, and mean time to recovery tracks how long a team takes to restore service after an incident. Hirani’s 2025 talk explicitly tied artificial intelligence work to shortening that recovery clock rather than replacing operators. (youtube.com) That emphasis matches a wider shift in reliability tooling toward using models on top of telemetry, runbooks and incident reports instead of treating a chatbot as the product. Last9 said on October 15, 2025 that Gartner had named it a “Cool Vendor in AI for SRE and Observability” for a platform built around unified telemetry and an agent software development kit. (last9.io) Hirani’s thread also touched agentic development, which means software agents can plan steps, fetch context and act through tools instead of only generating one answer. A 2025 survey on agentic retrieval‑augmented generation described the pattern as an evolution from standard retrieval systems toward more autonomous control structures and tool use. (arxiv.org) The operational catch is that agents inherit the same old reliability problems in a new form: stale documentation, inconsistent naming, missing ownership and weak post‑incident records. In a 2026 One2N interview, Hirani and One2N chief technology officer Jaideep said organizational context, knowledge sharing and responsibility ownership matter more than “yet another AI tool.” (youtube.com) Hirani has been making the same point in non‑artificial‑intelligence writing for years. In a December 27, 2022 Last9 post, he argued that incident reports need concise structure, links alongside screenshots and a clear sequence of events so other teams can act on them. (last9.io) That is why his production‑AI advice reads less like a model tutorial and more like an operations checklist: clean up the data, standardize the handoffs, tighten the runbooks and only then let models and agents into the loop. (x.com)

SRE shares production‑AI lessons

Get your own daily briefing