In our project, we are using repartition(1) to write data into table, I am interested to know why coalesce(1) cannot be used here because repartition is costly operation compared to coalesce. I know repartition distributes data evenly across partitions but when the output file is of single part file, why can't we use coalesce(1) ? please help me understand if any other factors are involved in this
CodePudding user response:
You state nothing else in terms of logic.
coalesce
will use existing partitions to minimize shuffling. In case of coalsece(1) and counterpart may be not a big deal, but one can take this guiding principle thatrepartition
creates new partitions and hence does a full shuffle. That said, coalsece can be said to minimize the amount of shuffling.In my spare time I chanced upon this https://medium.com/airbnb-engineering/on-spark-hive-and-small-files-an-in-depth-look-at-spark-partitioning-strategies-a9a364f908 excellent article. Look for the quote: Coalesce sounds useful in some cases, but has some problems.