My application has a Kafka input stream for a single topic, it does some filtering and aggregating of the data, and then writes to Elasticsearch. What I'm seeing is that while the application is distributed to all of the spark nodes and processing the data properly, only one node is pulling data, and the rest are idle. Also, I am using an R53 hostname for the Kafka nodes. Should I use a comma-separated list of the Kafka nodes instead? The topic has 20 partitions. I am running Spark 3.2.1 using only Spark Streaming (no DFS).
CodePudding user response:
The topic has 20 partitions
Then up to 20 executors should be able to consume in parallel.
using an R53 hostname for the Kafka nodes
Any Kafka client, including Spark, will need to communicate with the brokers individually. This means you'll need to expose each broker's advertised.listeners
setting such that Spark can communicate with each broker directly, and not via a single DNS name / load balancer address. If only one is resolvable, then you'll only be able to consume (or produce) to just that one.
Should I use a comma-separated list of the Kafka nodes instead
It's recommended, but not necessary. For example, what if the broker at the one address provided is not responding? The bootstrap protocol will return all advertised.listener
addresses back to the client based on its associated listeners
protocol.