System design prep must add AI controls
System design interviewing now expects candidates to layer intelligence concerns—model routing, token/cost controls, retrieval architecture, human‑in‑the‑loop, and eval pipelines—on top of classic scale problems like caching and partitioning. The shift reframes the LLM as one service in a distributed system where latency budgets, hallucination mitigation, and observability matter as much as throughput. (youtube.com) (cnn.com)
A system design interview used to stop at databases, caches, and queues. In 2026, a candidate can still nail sharding and lose the round if they cannot explain where a large language model sits in the request path, how much it costs per call, and what happens when it makes something up. (cnn.com) (docs.langchain.com) The basic shift is simple: the model is no longer the whole product. It is one service inside a larger distributed system, the same way a payment processor or search index is one service inside a shopping app. (docs.cloud.google.com) (docs.langchain.com) A large language model is a prediction engine that turns input text into output text token by token. A token is a small chunk of text, and modern model vendors price requests by counting those chunks on the way in and on the way out. (openai.com) (anthropic.com) That pricing changes the interview math. Anthropic lists Claude Sonnet 4.6 at $3 per million input tokens and $15 per million output tokens, while Claude Haiku 4.5 starts at $1 and $5, so a design now needs a reason for using the expensive model only when the task actually needs it. (anthropic.com 1) (anthropic.com 2) That is where model routing comes in. Amazon Bedrock now offers “intelligent prompt routing” through a single serverless endpoint that sends requests to different foundation models based on routing criteria, which is exactly the kind of control interviewers now expect candidates to describe. (docs.aws.amazon.com 1) (docs.aws.amazon.com 2) The next layer is retrieval, which is just a system fetching outside information before the model answers. Instead of trusting the model’s memory for a company handbook or a legal policy, the app pulls the relevant document first and gives the model a smaller, fresher packet of facts. (docs.langchain.com) (docs.cloud.google.com) That retrieval step creates a new failure mode. If the wrong document is fetched, the model can produce a polished answer grounded in bad context, so candidates now have to talk about document freshness, ranking quality, and fallback behavior, not just raw query throughput. (docs.langchain.com 1) (docs.langchain.com 2) Latency has changed too. OpenAI’s latency guide says teams should use fewer tokens, make fewer requests, parallelize work, and avoid defaulting to a model when a cheaper deterministic step will do, which turns response time into an architecture problem instead of a model problem. (openai.com) Caching now means more than storing database reads. OpenAI says prompt caching can reduce time to first token by up to 80% and input token costs by up to 90%, and Anthropic says its prompt caching can cut latency by up to 85% and cost by up to 90% for long prompts, so interview answers increasingly include repeated prompt prefixes alongside classic cache keys. (openai.com) (anthropic.com) Human review has become another system component. Microsoft Foundry’s evaluation docs describe setting acceptance thresholds before release, and many production flows now send edge cases like refunds, medical summaries, or policy violations to a person instead of letting the model answer alone. (learn.microsoft.com) (learn.microsoft.com) The hardest new piece is evaluation. LangSmith’s docs say model outputs are non-deterministic, which means the same prompt can produce different results, so teams build test datasets, score quality and safety, compare versions, and watch traces in production the way older systems watched error rates and tail latency. (docs.langchain.com) (docs.langchain.com) That is why system design prep now looks different from the old whiteboard routine. The candidate still needs load balancers and partitions, but now they also need token budgets, retrieval quality checks, model selection rules, prompt caching strategy, human escalation paths, and an evaluation loop that catches silent failures before users do. (docs.aws.amazon.com) (openai.com) (docs.langchain.com)