I am diving to understand how can I send(produce) a large batch of records to a Kafka Topic from Spark.
From the docs I can see that there is an attempt to use the same producer across tasks in the same workers. When sending a lot of records at once, the network will be a bottle-neck (as well as memory, since kafka will buffer records to be sent). So I am wondering what is the best configuration to improve network usage:
- Fewer workers with more cores (so I suppose, this means more threads)
- More workers with fewer cores per worker (so I suppose we will use better network IO, since it will be spread across different machines)
Let's say the options I have for 1 and 2 are as follows (from Databricks):
- 4 workers with 16 cores per worker = 64 cores
- 10 workers with 4 cores per worker = 40 cores
To better utilize network IO, which is the best choice?
My thought on this for now, but I am not sure, so I am asking you here: Although from a CPU point of view (expensive calculations jobs), the 1) would be better (more concurrency, and less shuffle), from a network IO point of view, I would rather use 2) even if I will have fewer cores overall.
Appreciate any input on this.
Thank you all.
CodePudding user response:
The best solution is to have more workers to achieve parallelism (scale horizontally). DataFrame have to be write to Kafka using streaming with Kafka as sink as explained here https://docs.databricks.com/spark/latest/structured-streaming/kafka.html (if you don't want to have persistent stream you can always use option trigger once). Additionally you can assume that 1 dataframe partition = 1cpu so you can optimize this way additionally (but databricks in streaming usually handle it automatically).
On Kafka side I guess that it could be good to have number of partitions/brokers similar to spark/databricks workers.