Expedia Group details accelerated testing method
Expedia Group's technology team published an article detailing its use of "interleaving" to accelerate A/B testing. The method mixes user experiences in real time to identify winning models more quickly than traditional tests. This provides early performance signals while reducing user exposure to underperforming variants.
- The core advantage of interleaving is its increased statistical power, allowing for the detection of true differences between models with less data. By presenting a blended list of results to the same user, it controls for user-to-user variability, which is a major source of noise in traditional A/B tests. - Companies like Netflix, Airbnb, and DoorDash have adopted interleaving, reporting that it can be 10 to over 100 times more sensitive than traditional A/B testing. This heightened sensitivity enables the detection of subtle model improvements that might otherwise be missed. - While A/B tests measure absolute metrics like click-through rates, interleaving typically provides relative preference outcomes, such as "Model A wins 60% of the time". This makes it ideal for quickly comparing the performance of ranking models. - Expedia Group's Test and Learn (EGTnL) platform has evolved to manage thousands of A/B tests annually. To mitigate the risks associated with this volume, they developed a "Circuit Breaker" system for real-time monitoring to automatically suspend underperforming tests within minutes of launch. - Within the first 24 hours of an experiment, Expedia's real-time monitoring system caught 36% of all experiment-related issues. In one instance, it detected a 39% drop in conversion within minutes, saving thousands of days of testing time by quickly identifying misconfigured experiments. - There are several established algorithms for this method, including Team-Draft Interleaving, where results from each model are alternately chosen, and Balanced Interleaving, which aims for equal representation of both models at all ranks. - Interleaving is most effective when comparing models that produce similar outputs and is primarily designed for head-to-head comparisons. It is not a complete replacement for A/B testing but serves as a complementary technique for rapid iteration. - Expedia Group leverages over 70 petabytes of travel data to train its hundreds of AI and machine learning models, which make over 900 billion predictions each year to personalize the travel experience.