In Spark difference between repartition(1) and coalesce(1)-CodePudding

In our project, we are using repartition(1) to write data into table, I am interested to know why coalesce(1) cannot be used here because repartition is costly operation compared to coalesce. I know repartition distributes data evenly across partitions but when the output file is of single part file, why can't we use coalesce(1) ? please help me understand if any other factors are involved in this

CodePudding user response：

You state nothing else in terms of logic.

coalesce will use existing partitions to minimize shuffling. In case of coalsece(1) and counterpart may be not a big deal, but one can take this guiding principle that repartition creates new partitions and hence does a full shuffle. That said, coalsece can be said to minimize the amount of shuffling.
In my spare time I chanced upon this https://medium.com/airbnb-engineering/on-spark-hive-and-small-files-an-in-depth-look-at-spark-partitioning-strategies-a9a364f908 excellent article. Look for the quote: Coalesce sounds useful in some cases, but has some problems.