Shopify Engineer: Treat AI Agents Like Microservices
Anna Li, a Staff ML Engineer at Shopify, argued that reliability in multi-agent LLM workflows depends on explicit agent contracts and robust fallback paths. Speaking on a podcast, she stated, “We treat each agent in the system as a microservice with strict input/output schemas, even if the ‘code’ is prompt engineering.” This approach includes versioning prompts like API changes to minimize "prompt drift."
- The microservice approach to AI agents decomposes systems into independent cognitive functions like planning, knowledge retrieval, and task execution, each with its own API. This pattern improves resilience, as a failure in one agent is isolated, and allows for independent scaling of components, a significant advantage over monolithic AI designs. - Frameworks for orchestrating these agents include AutoGen, which excels at multi-agent collaboration, LangChain for API-driven and Retrieval Augmented Generation (RAG) workflows, and CrewAI, which is suited for structured, role-based processes. The core architectural choice is often described as: LangChain provides the "LEGO bricks," AutoGen facilitates the "conversation" between agents, and CrewAI offers a "crew with a mission briefing." - A key challenge, "prompt drift," occurs when an LLM's interpretation of a stable prompt changes over time due to model updates, leading to performance degradation. Mitigation involves treating prompts like versioned APIs, implementing LLM observability to detect behavioral shifts, and using CI tools like PromptDrifter to automate regression testing of prompt-response pairs in the build pipeline. - In insurance, this architecture is used to automate claims processing, where multi-agent systems handle First Notice of Loss (FNOL), document classification, and fraud detection. Case studies show AI agents achieving up to a 91% automation rate for eligible motor claims and 99% straight-through processing, reducing claim lifecycle costs and cutting processing time by 46%. - For underwriters, LLM-powered agents extract and standardize risk data from unstructured documents, such as risk engineering reports, to support more informed decisions. This allows underwriters to focus on complex risks rather than manual data entry, improving efficiency and enabling a more consistent evaluation against an insurer's specific risk models. - The Principal Engineer path involves leveraging this deep technical expertise to set technical direction and influence without direct authority. A key responsibility is to align autonomous agent architecture with broader business goals, acting as the bridge between technical teams and stakeholders by translating business objectives into scalable, resilient technical strategies. - After a multi-year contraction, insurtech venture funding is rebounding, with global investment rising 19.5% in 2025 to $5.08 billion, the first annual increase since 2021. AI-focused insurtechs are a primary driver, capturing two-thirds (US$3.35 billion) of all funding in 2025, signaling strong investor confidence in agentic infrastructure for core processes like underwriting and claims.