Is "spark.sql.shuffle.partitions" configuration affects non sql shuffling?-CodePudding

We don't have a lot of SQL in our Spark jobs (That is a problem I know but for now its a fact). I want to optimize the size and number of our Spark shuffle partitions to optimize our Spark usage. I saw in a lot of sources that setting spark.sql.shuffle.partitions is a good option. But will it do any effect if we almost do not use spark SQL?

CodePudding user response：

Indeed spark.sql.shuffle.partitions has no effect on jobs defined through the RDD api.

The configuration you are looking for is spark.default.parallelism, according to the documentation:

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.