I'm learning about groupBy function on spark,I create a list with 2 partitions,then use groupBy to get every odd and even numbers.I found if I define
val rdd = sc.makeRDD(List(1, 2, 3, 4),2)
val result = rdd.groupBy(_ % 2 )
the result with goes to their own partition. But if I define
val result = rdd.groupBy(_ % 2 ==0)
the result turns to in one partition.could anybody explain why?
CodePudding user response:
It's just the hashing
applied to groupBy
Shuffle.
val rdd = sc.makeRDD(List(1, 2, 3, 4), 5)
With 5 partitions you see 2 partitions used and 3 empty. Just an algorithm.