Home > Enterprise >  startingOffset and endingOffset in pyspark structured-streaming
startingOffset and endingOffset in pyspark structured-streaming

Time:07-14

I'm using pyspark 2.4.5 to consume message from Kafka. For batch query, there are 2 options startingOffset and endingOffset can be used to read specific message on Kafka. I'm quite confused with the example in Spark document:

document of integration guide between pyspark and kafka

I don't understand the example for configuration with JSON format:

""" {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """

My questions are:

  • what does the configuration "0": 23 mean?
  • This configuration is used for consuming multiple topics, if I read only 1 topic A then what does the configuration look like?

CodePudding user response:

A kafka topic consists of multiple partitions where each partition would have its own offset. So a starting offset of {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} would mean:

Topic Partition Offset
topic1 0 23
topic1 1 -1
topic2 1 -2

Meaning for -1 and -2 are already specified in the doc. In case you are not familiar with kafka partitions, there are many good explanations out there like this.

For single topic, it would be simply: {"topicA":{"0":23,"1":-1}}

  • Related