Home > OS >  How to guarantee dataset is split over unique partitions
How to guarantee dataset is split over unique partitions

Time:06-20

I have a data type like this : case class Data(col: String, ...), and a Dataset[Data] ds. Some rows have columns filled with value 'a', and other with value 'b', etc.

I want to process separately all data with a 'a', and all data with a 'b'. But I also need to have all the 'a' in the same partition.

Question 1 :

If I do : ds.repartition(col("col")).mapPartition(data => ???)

Is it guaranteed by default that I will have all the 'a' in a single partition, and no 'b' mixed with it in this partition?

I can also do this to force the number of partitions :

val nbDistinct = ds.select("col").distinct.count
ds.repartition(nbDistinct , col("col")).mapPartition(data => ???)

But it adds an action that may be expensive in some cases.

Question 2 : Is there a good way to have this guarantees?

Thanks!

CodePudding user response:

All the 'a' will be in the same partition, but 'a' and 'b' may be mixed.

Even using nbDistinct, it is no enough to guarantee dataset is split over unique partitions, so the code should rather be :

val nbDistinct = ds.select("col").distinct.count
ds.repartition(col("col")).mapPartition{ data => 
   // split mixed values in a single partition with group by :
   data.groupBy(_.col).flatMap { case (col, rows) => ??? }
}

Other option would be to use be groupBy / groupByKey

  • Related