For example, my spark structured streaming application has Kafka as the message source, and below are the details of the different configurations.
Kafka setup:
Message source: kafka
Partitions : 40
input parameters:
maxOffsetsPerTrigger : 1000
Cluster setup:
Number of workers = 5
Number of cores/worker = 8
Question:
With the above setup, does it read
(1000 * 5 * 8) = 40000 messages every time
or
(1000 * 5) = 5000 messages every time
or
read 1000 messages and distribute it across the 5 worker nodes?
CodePudding user response:
Per documentation:
Rate limit on maximum number of offsets processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume.
So it's the last option in your list, and each executor at maximum will process 200 offsets per trigger, split between individual cores (25 offsets/core). But it could be smaller if you don't have enough data collected in the specific trigger period.
Also, in new versions of Spark, there are additional options, like, minOffsetsPerTrigger
that will allow to process bigger batches in case if your trigger period didn't have enough data to process.