Home > Software engineering >  Spark and Kafka: how to increase parallelism for producer sending large batch of records improving n
Spark and Kafka: how to increase parallelism for producer sending large batch of records improving n

Time:11-11

I am diving to understand how can I send(produce) a large batch of records to a Kafka Topic from Spark.

From the docs I can see that there is an attempt to use the same producer across tasks in the same workers. When sending a lot of records at once, the network will be a bottle-neck (as well as memory, since kafka will buffer records to be sent). So I am wondering what is the best configuration to improve network usage:

  1. Fewer workers with more cores (so I suppose, this means more threads)
  2. More workers with fewer cores per worker (so I suppose we will use better network IO, since it will be spread across different machines)

Let's say the options I have for 1 and 2 are as follows (from Databricks):

  1. 4 workers with 16 cores per worker = 64 cores
  2. 10 workers with 4 cores per worker = 40 cores

To better utilize network IO, which is the best choice?

My thought on this for now, but I am not sure, so I am asking you here: Although from a CPU point of view (expensive calculations jobs), the 1) would be better (more concurrency, and less shuffle), from a network IO point of view, I would rather use 2) even if I will have fewer cores overall.

Appreciate any input on this.

Thank you all.

CodePudding user response:

The best solution is to have more workers to achieve parallelism (scale horizontally). DataFrame have to be write to Kafka using streaming with Kafka as sink as explained here https://docs.databricks.com/spark/latest/structured-streaming/kafka.html (if you don't want to have persistent stream you can always use option trigger once). Additionally you can assume that 1 dataframe partition = 1cpu so you can optimize this way additionally (but databricks in streaming usually handle it automatically).

On Kafka side I guess that it could be good to have number of partitions/brokers similar to spark/databricks workers.

  • Related