LinkedIn's 'Venice' Offers Blueprint for Planetary-Scale Data

A deep dive into LinkedIn's "Venice" data store provides a case study in building planetary-scale systems. The architecture relies on geographically distributed sharding and tunable consistency models to deliver high-throughput, low-latency data access globally.

Venice originated as the successor to "Voldemort," an earlier key-value store at LinkedIn, with development ideation starting around 2014. A key figure in its development is Felix GV, who has been involved in all phases of the project from its early architecture to its open-sourcing. The system was designed specifically to handle "derived data"—data computed from other sources—for AI applications at LinkedIn. At its core, Venice is architected for extremely high-throughput asynchronous writes, a direct contrast to databases that prioritize strongly consistent online writes like MySQL or Apache HBase. It achieves this by funneling all data ingestion, whether from batch sources like Hadoop or streaming sources like Apache Samza, through Kafka. This design choice means it operates on an eventual consistency model. This eventual consistency model differs from the tunable consistency offered by systems like Apache Cassandra. While Cassandra allows clients to specify the number of replicas that must acknowledge an operation to achieve different levels of consistency, Venice is fundamentally designed for offline and nearline data ingestion, making it highly available for reads but not suitable for read-your-writes scenarios. This makes it ideal for use cases like powering recommendation engines and news feed ranking where slight data staleness is acceptable. Venice's role as a derived data store also distinguishes it from in-memory data stores like Redis. While Redis excels at use cases requiring extremely low latency for volatile data, such as caching, session management, and real-time leaderboards, Venice is optimized for storing and serving large, pre-computed datasets. For instance, it can handle the batch ingestion of terabytes of data daily, a task for which Redis is not primarily designed. Inside LinkedIn, Venice is a critical component of the AI infrastructure, serving as the storage layer for machine learning features. It powers over 1800 datasets and is used by more than 300 applications, handling a peak of 167 million key lookups per second. The system ingests an average of 14 GB of data per second, which translates to roughly 39 million rows per second. A key architectural feature is its "hybrid store" capability, which allows for the seamless combination of batch and streaming data. Datasets in Venice are versioned; a new batch push creates a new, immutable version of the data in the background. Real-time updates from streaming sources can then be applied on top of this new version before it is atomically swapped to become the primary serving version. To optimize read performance, Venice offers a "read compute" feature that allows computations like dot products and cosine similarity to be performed on the server-side, reducing the amount of data that needs to be transferred over the network. Additionally, the "Da Vinci" client provides a stateful local cache, which can serve reads with zero network hops for applications with the most stringent latency requirements. LinkedIn open-sourced Venice in September 2022 and has been actively developing it since, with over 800 new commits in the 18 months following the announcement. The project maintains a public Slack channel for community engagement. While specific external adoption metrics are not widely publicized, the open-sourcing effort aims to foster a community around the project and encourage its use in other large-scale data systems.

LinkedIn's 'Venice' Offers Blueprint for Planetary-Scale Data

Get your own daily briefing