RAG needs 50–100 ground truths
- Retrieval-augmented generation systems need a small, labeled test set before scaling, with practitioners recommending roughly 50 to 100 ground-truth examples. - Those examples should mirror real user questions, include expected answers and retrieved context, and be scored for correctness, faithfulness, precision, and recall. - The push reflects a shift from demo-driven RAG to traced, testable software with observability and regression checks. (docs.ragas.io)
Retrieval-augmented generation, or RAG, is a way to make a model answer from your documents instead of its memory. It works in two steps: fetch the right passages, then write an answer from them. (docs.ragas.io) That means a RAG system can fail in two different places. The retriever can fetch the wrong chunks, or the model can misread good chunks and still produce a wrong answer. (braintrust.dev) (docs.ragas.io) The practical fix is to treat RAG like software, not like a one-off demo. Teams build a small labeled set of questions with expected answers and use it as a repeatable test suite. (docs.ragas.io 1) (docs.ragas.io 2) In the operational advice circulating among practitioners, the number is modest: about 50 to 100 ground-truth cases. The point is not statistical perfection; it is to cover the failure modes that keep showing up in production. (docs.ragas.io) (braintrust.dev) Ragas, an open-source evaluation framework, expects test rows with fields like the user question, the contexts passed into the model, and the ground-truth answer. Its documentation says an ideal test set should closely mirror the real-world use case. (docs.ragas.io) Those labels let teams score more than one thing. Ragas documents metrics for answer correctness against ground truth, context precision for ranking relevant chunks high, and semantic similarity between the answer and the reference. (docs.ragas.io 1) (docs.ragas.io 2) (docs.ragas.io 3) Braintrust makes the same split explicit in its RAG evaluation guides. Its documentation says teams need to measure retrieval quality and generation quality separately so they can see whether failures start in search, prompting, or synthesis. (braintrust.dev 1) (braintrust.dev 2) That is where observability comes in. Braintrust’s tracing guidance says useful tooling shows each step of execution, including document retrieval, context assembly, prompt construction, and generation, so engineers can replay failures instead of guessing. (braintrust.dev) (braintrust.dev) Ragas also supports online integrations where teams log the contexts used at inference time even when no ground truth exists yet. Its Langfuse integration notes that ground-truth data can be absent in online evaluations, but the retrieved contexts are still captured for analysis. (docs.ragas.io) The pattern is simple: start with a small suite, log every retrieval path, and rerun the suite after each change to chunking, embeddings, rerankers, or prompts. That turns a RAG system from a polished demo into something teams can debug, compare, and ship repeatedly. (braintrust.dev) (braintrust.dev)