The AI Engineering Iceberg
There's a growing recognition of the "AI engineering iceberg" — the 90% of invisible work required before an LLM is even useful. A recent deep dive highlights the massive effort in data collection, cleaning, chunking, and evaluation that underpins successful RAG systems. This foundational work, from fixing multi-column data to creating synthetic tests for faithfulness, is where most RAG projects fail.
The "AI engineering iceberg" metaphor extends beyond just RAG systems; it reflects a broader reality in production AI where model selection itself is a significant, often counterintuitive, challenge. While powerful models excel in demos, they can be inefficient and costly for specific tasks like data extraction. This has led to a rise in the use of Small Language Models (SLMs) that are optimized for specialized functions, offering lower latency and cost. The process of preparing data for an LLM is a multi-stage pipeline that begins with identifying and collecting data from varied sources like PDFs, HTML, and JSON files. This is followed by extensive cleaning to remove duplicates, normalize text formats (e.g., converting to lowercase), and filter irrelevant content. Research suggests that up to 80% of the time spent on AI projects is dedicated to these data preparation tasks. Evaluating a RAG system requires a dual focus on both the retrieval and generation components. Key metrics for the retrieval process include precision (the percentage of retrieved documents that are relevant) and recall (the percentage of all relevant documents that were retrieved). For the generation component, evaluations measure faithfulness, answer relevance, and context recall, often using frameworks like Ragas. Even with robust data pipelines, RAG systems face challenges in production such as "prompt drift," where small changes to prompts can degrade performance, and the complexity of debugging when errors occur. To mitigate these issues, engineering teams are implementing more rigorous end-to-end evaluation and production observability to monitor for quality degradation, latency issues, and user feedback signals in real-time. The financial investment in the "invisible" part of the AI iceberg is substantial. Data collection and preparation alone can represent 15-25% of the total project cost. For a complex machine learning project, acquiring and cleaning a dataset of around 100,000 samples can cost upwards of $70,000, with an additional 80 to 160 hours required just for cleaning. Looking ahead, the evolution of RAG involves optimizing retrieval mechanisms for speed and developing hybrid models that combine fine-tuned LLMs with more intelligent retrieval systems. There is also a significant push towards multimodal RAG, which will allow systems to reason over not just text, but also images, diagrams, and structured data from knowledge graphs.