We want to use Apache Flink for the streaming job – read from one Kafka topic and write to another. The infrastructure will be deployed to Kubernetes. I want to restart the job on any PR merge to master
branch.
Therefore, I wonder whether Flink guarantees that resubmitting the job will continue the data stream from the last processed message? Because one of the most important job's feature is message deduplication on time window.
What are the patterns of updating streaming jobs for Apache Flink? Should I just stop the old job and submit the new one?
CodePudding user response:
My suggestion would be to simply try it.
Deploy your app manually and then stop it. Run kafka-consumer-groups
script to find your consumer group. Then restart/upgrade the app, and run the command again with the same group. If the lag goes down (as it should), rather than resets to the beginning of the topic, then it's working as expected, as it would for any Kafka consumer.
read from one Kafka topic and write to another.
Ideally, Kafka Streams is used for this.
CodePudding user response:
Kafka consumer offsets are saved as part of the checkpoint. So as long as your workflow is running in exactly-once mode, and your Kafka source is properly configured (e.g. you've set a group id), then restarting your job from the last checkpoint or savepoint will guarantee no duplicate records in the destination Kafka topic.
If you're stopping/restarting the job as part of your CI processing, then you'd want to:
- Stop with savepoint.
- Re-start from the savepoint
You could also set ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH
to true (so that a checkpoint is taken when the job is terminated), enable externalized checkpoints, and then restart from the last checkpoint, but the savepoint approach is easier.
And you'd want to have some regular process that removes older savepoints, though the size of the savepoints will be very small (only the consumer offsets, for your simple workflow).