Case Study: Zomato's Data Pipeline

A new breakdown details food delivery giant Zomato's massive data pipeline. The system handles 450 million Kafka messages per minute using a stack of Spark, Flink, Airflow, and Trino on EMR with Iceberg tables, offering a blueprint for enterprise-scale data processing.

Zomato's data architecture is engineered to handle immense scale, processing not just order and user data, but also real-time feedback for its advertising ecosystem. Legacy systems struggled with massive state pipelines, some exceeding 150 GB, which caused slow recovery times and potential data loss. To solve this, Zomato implemented a reconciliation mechanism and shifted to Flink SQL, which significantly reduced system state size and infrastructure costs. This complex system supports over 50 million monthly orders from 18 million active customers, connecting them with approximately 230,000 restaurant partners and a fleet of 350,000 delivery partners. The company, along with its quick-commerce service Blinkit, manages over 1,000 data marts that are refreshed every 15 to 60 minutes to provide fast, scalable analytics at a low cost. At the core of this infrastructure is Apache Kafka, which functions as a real-time data streaming backbone for asynchronous processing across Zomato's microservices. When a user places an order, the event is published to Kafka and consumed by various services for order management, payment processing, and restaurant notification. This decoupled approach, a hallmark of the Kappa architecture Zomato employs, ensures that processes like order tracking and notifications are handled in near real-time. For data storage and querying, Zomato leverages a data lakehouse on AWS, utilizing Apache Iceberg as the open table format. Flink consumes data from Kafka, enriches it, and writes it to Iceberg tables on Amazon S3 in ORC format. Trino, a distributed SQL query engine, is then used for fast, interactive analytics directly on the data lake, a strategy also employed by companies like Quora and Netflix. This separation of compute and storage allows for efficient querying of massive datasets without data duplication. To optimize performance and manage costs, Zomato migrated its Trino and Apache Druid workloads to AWS Graviton-based instances. This move resulted in a 30% reduction in Amazon EC2 usage costs and improved query performance by as much as 25%. The migration also allowed Zomato to right-size its compute footprint, reducing the peak capacity of its Druid and Trino clusters by 25% and 20%, respectively. The Iceberg table format provides Zomato with significant advantages, including ACID transactions, schema evolution without rewriting tables, and time-travel capabilities for reproducible queries. This is crucial for maintaining data integrity and allowing for easier auditing and debugging. The largest upsert table in their system handles around 10 million daily upserts on a 5TB table, while the largest append table is approximately 50TB with 2 billion daily inserts. Zomato's data science applications are extensive, powering everything from personalized food recommendations and delivery route optimization to fraud detection. They use a variety of machine learning models and algorithms, such as Dijkstra's for route planning and collaborative filtering for recommendations, to enhance user experience and operational efficiency. This data-driven approach is fundamental to their ability to make real-time decisions in a highly dynamic market. The architecture is designed for high availability and disaster recovery, with strategies like database sharding based on geographic location or user ID, and master-slave replication to handle read and write operations efficiently. Regular database snapshots are stored in Amazon S3, ensuring that data can be restored in case of a failure, with failover mechanisms to switch to a secondary data center with minimal downtime.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.