As far as i understand,for a spark streaming application(structured streaming or otherwise),to manually manage the offsets ,spark provides the feature of checkpointing where you just have to configure the checkpoint location(hdfs most of the times) while writing the data to your sink and spark itself will take care of managing the offsets.
But i see a lot of usecases where checkpointing is not preferred and instead an offset managemenent framework is created to save offets in hbase or mongodb etc. I just wanted to understand why checkpointing is not preferred and instead a custom framework is created to manage the offsets? Is it because it will lead to creation of small file problem in hdfs?
https://blog.cloudera.com/offset-management-for-apache-kafka-with-apache-spark-streaming/
CodePudding user response:
Small files is just one problem for HDFS. Zookeeper would be more recommended out of your listed options since you'd likely have a Zookeeper cluster (or multiple) as part of Kafka and Hadoop ecosystem.
The reason checkpoints aren't used is because they are highly coupled to the code's topology. For example, if you run map, filter, reduce or other Spark functions, then the exact order of those matters, and are used by the checkpoints.
Storing externally will keep consistent ordering, but with different delivery semantics.
You could also just store in Kafka itself (but disable auto commits)
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#storing-offsets