ConvApparel dataset released

Google Research published ConvApparel, a dataset designed to close the 'realism gap' for LLM user simulators in conversational agents — a useful step for fashion tech and retail personalization research. The dataset is pitched as improving realism in agent interactions, which can help train conversational features in product experiments (x.com).

Most shopping chatbots are tested against fake customers who are too patient, too clear, and too easy to please. Google Research says that gap between simulated users and real people is big enough to make a system look good in testing and then stumble in an actual conversation. (research.google) A user simulator is a stand-in customer: a model that talks to another model so researchers can run thousands of practice conversations without paying thousands of people. It is the conversational version of a flight simulator, except the “pilot” is a shopping assistant and the “weather” is human behavior. (research.google) The problem is that many of these stand-in customers act like idealized test subjects. The ConvApparel paper says simulated users often show “unnatural patience” and “encyclopedic knowledge,” which means they tolerate weak recommendations and describe what they want more clearly than real shoppers usually do. (arxiv.org) ConvApparel tries to fix that by collecting human-artificial intelligence shopping conversations in apparel, where preferences are messy and subjective. A shopper looking for “something cute for a summer wedding” is harder to model than someone asking for a battery with a specific voltage. (research.google; arxiv.org) The dataset was built with a two-recommender setup instead of a single chatbot. One recommender was designed to be good and one was designed to be bad, so the researchers could capture both satisfying and frustrating conversations from the same kinds of shopping goals. (arxiv.org; research.google) That design gave the researchers something rare in conversational data: a built-in before-and-after comparison for user reactions. If the same type of shopper behaves differently with a helpful assistant than with a weak one, a simulator can be checked on whether it changes in the same direction. (arxiv.org) The paper also adds first-person annotations of user satisfaction, which means the people in the conversations explicitly recorded how the interaction felt from their side. That gives the dataset more than chat transcripts; it gives labels for whether the recommendation experience actually worked. (arxiv.org) Google pairs the dataset with a validation framework instead of treating realism as a vibe. The framework combines statistical alignment, a learned “human-likeness” score, and counterfactual validation, which is a test of whether the simulator reacts differently when the assistant gets better or worse. (research.google; research.google) In the paper, the strongest simulator used reinforcement learning with iterative critique. That is a training loop where the model gets feedback on each conversation, adjusts, and tries again, like a sales trainee reviewing call recordings after every shift. (research.google) The release landed on April 9, 2026 in a Google Research blog post, and the work is also listed for the 2026 Conference of the European Chapter of the Association for Computational Linguistics in Rabat. That puts ConvApparel in the benchmark category, not the consumer product category: it is infrastructure for testing shopping assistants, not a shopping app itself. (research.google; research.google) For fashion and retail teams, the practical use is simple: train and test agents against users who hesitate, change their minds, and react differently to bad advice. A conversational recommender that survives that kind of practice is more likely to hold up when the customer is a real person with vague taste, limited patience, and no interest in helping the model succeed. (research.google; arxiv.org)

ConvApparel dataset released

Get your own daily briefing