Embedding Training Methods Criticized for Lacking Hierarchy
A recent analysis argues that many modern embedding techniques fail to encode hierarchical structures, leading to suboptimal performance in downstream tasks like search and recommendation. The author advocates for training methodologies that explicitly model these relationships to improve the semantic quality of learned representations. This approach aims to create more nuanced and accurate content and user embeddings.
- Traditional flat embedding models like Word2Vec represent all words in a single vector space, which struggles to distinguish between words with multiple meanings (polysemy) and fails to capture hierarchical relationships. For example, the word "bank" has the same embedding in "river bank" and "investment bank," and the model doesn't inherently know that a "poodle" is a type of "dog," which is a type of "mammal." - To better represent hierarchical data, researchers at Facebook AI Research developed Poincaré embeddings, which learn representations in hyperbolic space instead of the traditional Euclidean space. This geometric approach allows the model to learn hierarchy and similarity simultaneously, making it more efficient at representing tree-like structures like taxonomies or organizational charts. - A recent paper from Google Research introduced a "pretrain-finetune" recipe for hierarchical retrieval that improved the recall of distant ancestor categories from 19% to 76%. The method uses an asymmetric dual-encoder, meaning a concept is embedded differently depending on whether it's on the query side or the document side to resolve geometric tensions. - In recommendation systems, hierarchical embeddings can model user preferences at different levels of granularity. For instance, if a user interacts with "wireless headphones," the system can update its understanding of their preference for the parent categories "audio" and "electronics," leading to more diverse and relevant suggestions. - Transformer models like BERT, which have a 512-token input limit, are being adapted with hierarchical structures to process long documents for tasks like classification and summarization. These "Hierarchical BERT" models first encode smaller segments of the text and then use another layer to aggregate these segment-level representations. - Graph Neural Networks (GNNs) are another common approach for learning from structured data, where they propagate information through layers to capture relationships. For group recommendations, a model called HyperGroup uses a GNN to learn individual preferences from their friends' interactions before modeling the group's preferences. - Netflix is moving towards a centralized foundation model for personalization to learn from comprehensive user histories at a large scale, generating embeddings for members, videos, and genres. These embeddings are then used by various downstream models for tasks like candidate generation and title-to-title recommendations.