Hands-on Kafka/Flink tutorial

A recent practical YouTube walkthrough shows how to build a real-time pipeline using Kafka, Flink and Postgres—useful for ingesting and processing large AIS and sensor streams (x.com) (youtube.com). The tutorial is pitched toward real-world ingestion patterns rather than toy examples, so it’s worth comparing its partitioning and checkpoint choices to your fleet’s telemetry profile (x.com).

A real-time pipeline is the software version of an airport baggage belt: data keeps arriving, and each piece has to be routed before the next one piles up. A January 30, 2024 YouTube walkthrough builds that belt with Apache Kafka for intake, Apache Flink for processing, and PostgreSQL for storage, and it demos an end-to-end flow with 100,000 records. (youtube.com) Apache Kafka is the inbox in that setup. Kafka stores events in ordered slices called partitions, and each partition can be read by exactly one consumer inside a consumer group at a time, which is how teams spread load across machines without losing per-partition order. (docs.confluent.io) Apache Flink is the moving workbench in the middle. Flink keeps state while records stream past, and its checkpoint system periodically saves both operator state and stream positions so a crashed job can restart as if the failure never happened. (nightlies.apache.org) PostgreSQL is the shelf at the end of the belt. The tutorial uses it as a sink, which is practical for teams that want a familiar relational database behind dashboards, alerts, or downstream application queries instead of another specialized analytics store. (youtube.com) (postgresql.org) This gets more concrete with ship tracking and industrial telemetry. The United States Coast Guard says Automatic Identification System vessel transponders can update as often as every two seconds and handle well over 4,500 reports per minute, which is exactly the kind of steady, never-finished stream that breaks batch scripts. (navcen.uscg.gov) The National Oceanic and Atmospheric Administration says Automatic Identification System data carries location and vessel characteristics in real time for large vessels in U.S. waters. That means a single fleet feed can mix position, speed, heading, and vessel identity, so your pipeline has to preserve enough order to reconstruct what happened to one ship without forcing every ship onto one machine. (coastalscience.noaa.gov) That is why partition choice is not a minor setting. If you partition by vessel identifier, one ship’s events stay in order inside one Kafka partition, but a few very chatty ships can create hot partitions; if you partition randomly, throughput spreads out, but per-vessel calculations like speed jumps or missing-position detection get harder. (docs.confluent.io) Checkpoint choice has the same tradeoff shape. Flink’s own docs say JobManagerCheckpointStorage is encouraged for local development and very small state, while FileSystemCheckpointStorage is encouraged for high-availability setups, so a demo that works on a laptop may need a different recovery plan when a fleet feed runs all day and keeps large keyed state. (nightlies.apache.org) The tutorial’s useful part is not the Docker Compose file or the fact that it reaches PostgreSQL. The useful part is that it shows the seams where real systems usually fail first: how records are keyed, how consumers are grouped, where state is saved, and what happens when the stream is faster on Monday morning than it was in the sample run. (youtube.com) (docs.confluent.io) (nightlies.apache.org) If your telemetry looks like sparse temperature pings, this pattern is probably overbuilt. If your telemetry looks like Automatic Identification System tracks, engine messages, or sensor bursts that never stop, this Kafka-Flink-PostgreSQL stack is a good test bench, and the first thing to compare is whether its partition key and checkpoint storage match your actual event rate, skew, and recovery target. (youtube.com) (navcen.uscg.gov) (nightlies.apache.org)

Hands-on Kafka/Flink tutorial

Get your own daily briefing