RAG checklists go production
Practitioners are converging on production RAG patterns—hybrid sparse+dense retrieval, BM25 plus neural rerank, context deduplication, token budgeting and answer caching—framing RAG as a data‑ops and governance problem rather than only a modeling one. (x.com) (x.com)
Retrieval-augmented generation is settling into an operations playbook: teams are pairing keyword search, vector search, reranking, chunk control, and caching instead of treating it as a model-only trick. (learn.microsoft.com) (elastic.co) Retrieval-augmented generation, or RAG, works by fetching outside documents before a model answers, so the model can ground a reply in current or proprietary data. Microsoft calls it an “industry-standard approach” for applications that need information the model does not already know. (learn.microsoft.com 1) (learn.microsoft.com 2) The retrieval step is where the production checklist is getting more specific. Elasticsearch says hybrid search blends lexical ranking such as BM25 with semantic retrieval, while Azure says its semantic ranker reranks an initial BM25- or Reciprocal Rank Fusion-ranked result set. (elastic.co) (learn.microsoft.com) That two-stage pattern has become common: a first pass casts a wide net, then a reranker reorders the shortlist. Cohere says reranking can sit on top of lexical or semantic search, and Pinecone describes rerankers as one of the fastest fixes when an out-of-the-box RAG system underperforms. (docs.cohere.com) (pinecone.io) The next bottleneck is not finding text but deciding how much of it to send. Azure’s RAG guidance says oversized chunks are expensive, can overwhelm token limits, and often reduce answer quality by stuffing the model with irrelevant context. (learn.microsoft.com) That is why teams now talk about token budgets the way search engineers talk about latency budgets. OpenAI’s token counting API is designed to measure the exact payload a request will send, including tools, schemas, images, and model-specific behavior that local estimates can miss. (developers.openai.com) Caching has moved into the same production checklist. OpenAI says prompt caching is automatic on prompts of 1,024 tokens or longer, can cut latency by up to 80 percent and input costs by up to 90 percent, and works best when static instructions stay at the front of the prompt. (developers.openai.com) The center of gravity is shifting from model selection to data handling. Azure’s current RAG overview lists query understanding, multi-source access, token constraints, response-time expectations, and security and governance as core implementation problems. (learn.microsoft.com) That framing changes who owns the work. A production RAG system now looks less like a single prompt wrapped around a vector database and more like a search stack with document preparation, ranking logic, access controls, evaluation, and cost controls tuned together. (learn.microsoft.com) (pinecone.io) The result is a plainer definition of progress: fewer duplicate chunks, fewer wasted tokens, faster repeated answers, and retrieval that can explain why one passage beat another. The checklist is getting longer, but the pattern is getting easier to recognize. (learn.microsoft.com) (docs.cohere.com)