Synthetic Data Validation Startup 'Solid' Launches with $20M

Solid Data, Inc. has launched with $20 million in seed funding to automate the generation and validation of synthetic data for enterprise AI. The company aims to make AI systems more reliable at scale by addressing data pipeline challenges. The launch comes as competitors like Rapidata.ai also claim to reduce bottlenecks but acknowledge the continued need for human-in-the-loop QA.

- Solid's co-founders, CEO Yoni Leitersdorf and CTO Tal Segalov, are repeat entrepreneurs with experience in building enterprise data platforms. They aim to solve the problem of AI's unreliability stemming from a lack of understanding of business context and inconsistent data definitions. - The $20 million seed funding round was led by Team8 and SignalFire, and will be used to accelerate product development and expand the team to support a growing customer base. - While synthetic data can be generated much faster than human labeling, it can be less accurate for tasks requiring contextual nuance. Hybrid approaches that combine synthetic data for scale with human labeling for critical, nuanced tasks often yield the best performance, improving model accuracy and reducing costs. - Reinforcement Learning from Human Feedback (RLHF) is a key technique for aligning large language models with human values, but it relies on high-quality, consistent feedback from human labelers. The quality of this human feedback is a direct bottleneck for the performance and safety of frontier AI models. - An alternative approach, Constitutional AI, aims to reduce the reliance on constant human feedback by providing the AI with a set of guiding principles to self-evaluate and refine its outputs. This method, known as Reinforcement Learning from AI Feedback (RLAIF), uses the AI's own critiques based on its "constitution" to train a preference model. - The evaluation of agentic AI systems requires a shift from measuring single-model outputs to assessing the emergent behaviors of the entire system. This includes benchmarking tool selection accuracy, multi-step reasoning, and task completion success rates across various environments using frameworks like AgentBench and WebArena. - The data labeling industry is moving away from a low-skill, gig-economy model towards a demand for high-context, domain-specific expertise from professionals like doctors and lawyers to provide nuanced feedback for training advanced AI systems. This has led to a significant increase in what top AI labs are spending on human-in-the-loop data pipelines. - A major challenge for enterprise AI adoption is poor data quality and the lack of sufficient proprietary data, with many companies struggling with fragmented and siloed information. This "technical debt" in data infrastructure can significantly increase the cost and complexity of AI projects.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.