Home > Software engineering >  Is "spark.sql.shuffle.partitions" configuration affects non sql shuffling?
Is "spark.sql.shuffle.partitions" configuration affects non sql shuffling?

Time:04-19

We don't have a lot of SQL in our Spark jobs (That is a problem I know but for now its a fact). I want to optimize the size and number of our Spark shuffle partitions to optimize our Spark usage. I saw in a lot of sources that setting spark.sql.shuffle.partitions is a good option. But will it do any effect if we almost do not use spark SQL?

CodePudding user response:

Indeed spark.sql.shuffle.partitions has no effect on jobs defined through the RDD api.

The configuration you are looking for is spark.default.parallelism, according to the documentation:

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

  • Related