Exactly‑once is end‑to‑end
Practitioners note that exactly‑once guarantees for Flink plus Kafka are only meaningful when the whole pipeline—source offsets, state checkpoints, and transactional or idempotent sinks—is designed for it, not just because those frameworks are used. Common operational advice includes using event time, tuning checkpoint intervals, and ensuring sinks are replay‑safe for AIS and sensor streams. (x.com)
A streaming pipeline is like an assembly line for events, and “exactly once” only holds if every handoff can recover without replaying or dropping work. Apache Flink says its end-to-end guarantee extends to external systems only when those systems can commit or roll back writes in step with Flink checkpoints. (flink.apache.org) Apache Flink is a stream-processing engine, and Apache Kafka is an event log that stores ordered records for producers and consumers. Flink’s checkpoint system saves operator state and the corresponding stream positions so a failed job can resume from a consistent point. (flink.apache.org; nightlies.apache.org) Kafka’s own guarantee is narrower than many teams assume. Apache Kafka says the producer’s idempotent mode, available since version 0.11, prevents duplicate writes from retries, while transactional mode coordinates writes and consumed offsets for consume-process-produce flows. (kafka.apache.org; docs.confluent.io) That means a Flink-plus-Kafka stack is not automatically “exactly once” from source to destination. Flink’s 2018 end-to-end semantics post says the sink must support coordinated commit or rollback with checkpoints, usually through a two-phase commit pattern. (flink.apache.org) The sink is where many pipelines fail the promise. If the destination is a database, object store, alerting system, or application programming interface that cannot deduplicate or transact, a restarted job can still write the same result twice even when Flink state recovers cleanly. (flink.apache.org; docs.confluent.io) Time handling is another fault line. Flink distinguishes event time, which uses the timestamp carried by the record, from processing time, which uses the machine clock at execution; event time is the mode built for out-of-order streams such as vessel Automatic Identification System feeds and industrial sensor data. (nightlies.apache.org; flink.apache.org) Checkpoint tuning is operational, not cosmetic. Flink’s operations docs say checkpoints can be asynchronous and incremental, and its large-state tuning guide says production jobs need checkpoint settings that complete reliably and let the application catch up after a failure. (flink.apache.org; nightlies.apache.org) Backpressure can turn that tuning into a correctness issue. Flink’s checkpointing docs and its unaligned-checkpoint explainer say exactly-once checkpoints normally require barrier alignment, and heavy backpressure can delay those checkpoints enough to stretch recovery windows or stall commits to transactional sinks. (nightlies.apache.org; flink.apache.org) Flink has also changed how it handles jobs with finished tasks, a detail that matters in mixed bounded-and-unbounded pipelines. The project said in its July 11, 2022 FLIP-147 post that missing checkpoints after some tasks finished could leave two-phase-commit sinks unable to commit the last records, which is why later support for checkpoints after tasks finished mattered. (flink.apache.org) So the practical rule is narrower and stricter than the slogan. Exactly once is a property of the whole path — source offsets, Flink state, checkpoint timing, and a replay-safe sink — not a badge you get for using Apache Flink and Apache Kafka together. (flink.apache.org; kafka.apache.org)