Use circuit breakers for LLMs
- Recent industry videos pushed circuit breakers, fallbacks and recovery paths as core resilience patterns for LLM‑based production applications. - Recommended practices include treating LLM inference as an unreliable external dependency, routing to smaller models, caching responses, and defining health states like healthy/slow/low‑confidence/unavailable. - The media cluster of videos framed these patterns as table stakes for production LLMs and production architecture blueprints. (youtube.com) (youtube.com) (youtube.com)
1/ Large language models power everything from chatbots to code assistants, but they're notoriously unreliable in production. Recent videos from industry experts are pushing circuit breakers, fallbacks, and recovery paths as essential resilience patterns for LLM apps. 2/ Why treat LLMs like flaky external services? LLM inference is stochastic—outputs vary, latency spikes under load, and errors like hallucinations or timeouts cascade. Videos frame it as a core dependency needing the same guards as any distributed system. 3/ Circuit breakers are the first line: they monitor LLM call success rates, latencies, and confidence scores. Hit a threshold (e.g., 20% error rate over 10 calls), and the breaker "trips," halting requests to prevent outages. Recovery: half-open state tests gradual resumption. 4/ Fallbacks kick in post-breaker: route to smaller, faster models like Llama 3 8B instead of GPT-4o. Or hit cache for identical prompts. No cache? Drop to rules-based logic or simple keyword responses. This keeps apps responsive even if the big model flakes. 5/ Define explicit health states for observability: Healthy (low latency, high confidence), Slow (P95 > 5s, queue requests), Low-Confidence (scores < 0.8, flag for review), Unavailable (full fallback). Metrics drive alerts and dashboards. 6/ Production blueprint example: Incoming query → confidence check → route to model pool (primary/backup) → validate output → cache hit/miss → respond or degrade gracefully. Add retries with exponential backoff, but cap at 3 to avoid thundering herds. 7/ Caching is underrated: store prompt-response pairs with TTLs based on topic volatility (e.g., 1h for news, 24h for code snippets). Use vector DBs for semantic matches. Reduces costs 50-80% on repeat traffic. 8/ Real-world trigger: During peak hours, GPT-4 latency jumps to 10s+. Circuit breaker trips after 15% failure. App switches to fine-tuned Mistral 7B (200ms response), users see no hiccup. Postmortem tunes thresholds. 9/ Tools to implement: Resilience4j or Hystrix for breakers in Java; Polly in.NET; custom with Prometheus for metrics. LangChain/PromptFlow add LLM-specific guards like output parsers. Open-source repos like Helicone track this in the wild. 10/ These aren't nice-to-haves—videos call them "table stakes" for prod LLMs. Skip them, and one bad inference batch kills your app's trust. Build them in from day one. Next: agentic systems will demand even tighter resilience.