We don't have a lot of SQL in our Spark jobs (That is a problem I know but for now its a fact).
I want to optimize the size and number of our Spark shuffle partitions to optimize our Spark usage. I saw in a lot of sources that setting spark.sql.shuffle.partitions
is a good option. But will it do any effect if we almost do not use spark SQL?
CodePudding user response:
Indeed spark.sql.shuffle.partitions
has no effect on jobs defined through the RDD api.
The configuration you are looking for is spark.default.parallelism
,
according to the documentation:
Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.