I am trying to understand spark partitions and in a blog I come across this passage
However, you should understand that you can drastically reduce the parallelism of your data processing — coalesce is often pushed up further in the chain of transformation and can lead to fewer nodes for your processing than you would like. To avoid this, you can pass shuffle = true. This will add a shuffle step, but it also means that the reshuffled partitions will be using full cluster resources if possible.
I understand that coalesce means to take the data on some of the least data containing executors and shuffle them to already existing executors via a hash partitioner. I am not able to understand what the author is trying to say in this para though. Can somebody please explain to me what is being said in this paragraph?
CodePudding user response:
Coalesce
has some not so obvious effects due to Spark Catalyst.E.g.
Let’s say you had a parallelism of 1000, but you only wanted to write 10 files at the end. You might think you could do:
load().map(…).filter(…).coalesce(10).save()
However, Spark’s will effectively push down the coalesce operation to as early a point as possible, so this will execute as:
load().coalesce(10).map(…).filter(…).save()
You can read in detail here an excellent article, that I quote from, that I chanced upon some time ago: https://medium.com/airbnb-engineering/on-spark-hive-and-small-files-an-in-depth-look-at-spark-partitioning-strategies-a9a364f908
In summary: Catalyst treatment of coalesce
can reduce concurrency early in the pipeline. This I think is what is being alluded to, though of course each case is different and JOIN and aggregating are not subject to such effects in general due to 200 default partitioning that applies for such Spark operations.
CodePudding user response:
As what you have said in your question "coalesce means to take the data on some of the least data containing executors and shuffle them to already existing executors via a hash practitioner". This effectively means the following
- The number of partitions have reduced
- The main difference between repartition and coalesce is that in coalesce the movement of the data across the partitions is fewer than in repartition thus reducing the level of shuffle thus being more efficient.
- Adding the property shuffle=true is just to distribute the data evenly across the nodes which is the same as using repartition(). You can use shuffle=true if you feel that your data might get skewed in the nodes after performing coalesce.
Hope this answers your question