Side Project 'LLM Stats' Becomes Industry Tool
LLM Stats, a platform for tracking AI benchmarks, evolved from a simple side project into an industry-standard tool used by thousands. The project's growth serves as a case study in how iterating based on user research can turn a hobby into a valuable portfolio piece for aspiring PMs.
LLM Stats was initially developed by Jonathan Chávez and Sebastian Crossa, who met during their first year of college in Mexico. They collaborated on several side projects before creating the leaderboard, which quickly grew to 60,000 monthly active users and a total of over 300,000 unique visitors within a few months of its launch. The platform's core mission is to provide transparency in the performance and benchmarking of a wide array of AI models. It caters to developers and researchers by aggregating official benchmarks and pricing from major providers like OpenAI, Google, and Anthropic into accessible leaderboards and comparison tools. This focus on practical details, such as pricing and context windows alongside performance scores, helped it stand out from other leaderboards. The success and learnings from LLM Stats led the founders to their next venture, ZeroEval, a Y Combinator-backed company. ZeroEval addresses a more complex problem they identified: the difficulty of evaluating sophisticated, multi-turn AI agents. This new tool is designed to help developers build more reliable AI systems by creating evaluations that learn from mistakes over time. ZeroEval represents a clear evolution from the initial concept of LLM Stats. While LLM Stats focuses on transparently presenting existing benchmarks, ZeroEval is built to create new, more nuanced evaluation methods. It allows for the creation of "calibrated LLM judges" that improve as they see more production data and receive feedback on incorrect samples, moving beyond static performance metrics. The problem of unreliable AI outputs is a significant concern for users, with one survey indicating that 35% of users identify reliability and inaccurate outputs as their primary issue with the technology. ZeroEval tackles this by offering features like "Autotune," which automatically evaluates models and optimizes prompts based on a small number of human-labeled examples. This progression from a side project to a YC-backed startup demonstrates a keen understanding of the evolving needs within the AI development community. The founders first addressed the need for a clear, consolidated view of model performance and then moved on to solve the more advanced challenge of ensuring AI reliability through iterative, feedback-driven evaluation.