Insurtech CTO on Production Agentic AI
In a recent podcast, the CTO of a major insurtech described their production claims automation pipeline as a "swarm of LLM agents for intake, validation, and escalation." They emphasized that the key enabler is not just chaining agents but using a shared event bus and a supervisor process that can re-assign tasks if an agent stalls or fails, ensuring resilience.
- The "swarm" architecture described is a form of decentralized, multi-agent system where specialized agents collaborate and emergent behavior solves complex problems, contrasting with hierarchical "supervisor" patterns where a central agent decomposes tasks and orchestrates a workflow. The supervisor model, common in frameworks like LangGraph, provides more explicit control and auditability by managing state and delegating to "worker" agents for specific functions like data validation or API calls. - A shared event bus is central to resilient agentic systems because it decouples agents, allowing them to operate asynchronously. This event-driven architecture (EDA) is critical for handling the high latency of LLM calls and allows for patterns like retries, graceful degradation, and independent scaling of agent services without creating brittle, direct dependencies. - Open-source frameworks provide distinct patterns for building these systems: Microsoft's AutoGen excels at conversation-driven, flexible multi-agent collaboration, while CrewAI enforces a role-based structure (e.g., 'researcher,' 'writer') for more team-like orchestration. LangChain's LangGraph is designed for creating stateful, cyclical graphs, which is essential for workflows that require persistence, retries, and human-in-the-loop interventions. - Beyond claims intake, these multi-agent patterns are being applied to automate complex underwriting assessments and dynamic fraud detection. A key challenge in production insurance systems is "temporal data drift," where evolving policy language and new fraud patterns can degrade model accuracy from over 85% to under 40% in months if the system isn't designed for continuous, component-level fine-tuning. - Productionizing agentic systems introduces significant operational challenges beyond prototyping, including high computational costs from numerous LLM calls, unpredictable end-to-end latency, and the difficulty of debugging and ensuring response quality across multiple interacting agents. - A supervisor agent's core responsibilities include not just delegating tasks but also performing quality control on each worker agent's output before triggering the next step and synthesizing the final, coherent result from multiple specialized outputs. This hierarchical approach improves reliability and makes complex failures more inspectable compared to flatter, more decentralized agent structures. - In insurance claims, agentic workflows can automate the entire lifecycle by classifying claim severity, cross-referencing policy data against unstructured documents, and routing complex cases to human adjusters, with some implementations improving accuracy by up to 99.99%. One Swiss insurer achieved a 40% automation rate in processing paper-based claims by implementing a system to triage, route, and extract information. - The choice between orchestration styles has direct implications for development velocity and control; CrewAI's structured flows can offer faster consistency for teams, while AutoGen provides high flexibility for experimentation, and LangGraph offers explicit control for durable, production-grade workflows.