Home > Back-end >  Number of partitions in spark with DRA enabled
Number of partitions in spark with DRA enabled

Time:10-31

It suggests 2-3 tasks per CPU core, so I set spark.sql.shuffle.partitions = 2 * spark.executor.cores * spark.executor.instances for spark sql job in the fixed size cluster.

But executors (de)allocated dynamically while DRA enabled, How to set partitions properly or just leave it as default (200) ?

CodePudding user response:

DRA and sql.shuffle.partitions are connected in a way that may not be clear at first sight. When you do the shuffle, Spark is going to create tasks in a number equals to shuffle.paritions and then Spark will try to allocate as many executors as it can to provide the best paralellism (so if you have 200 tasks and 5 cores per executor Spark will demand 40 executors because one task is executed by one core at time).

So if you increase number of partitions you are going also to increase number of max executors that your job is going to try to allocate so it will be hard to choose a constant number to meet the condition you mentioned

To do that, you can try to set spark.dynamicAllocation.maxExecutors, with this set to some value Spark is not going to allocate more than stated in this param

If you want to have at most 2 tasks per CPU you can adjust the formula and use something like this:

spark.sql.shuffle.partitions = 2 * spark.executor.cores * spark.dynamicAllocation.maxExecutors

But from my experience if you want to have the best performance you should analyse your data and cluster possibilities to choose the best option for you. The best option is to have partitions around 100mb-200mb and spawn a lot of small tasks with proper sql.shuffle.partitions, but it depends on your use case. You can check some example here

  • Related