Home > Mobile >  Sorting values of an array type in RDD using pySpark
Sorting values of an array type in RDD using pySpark

Time:11-02

I have an RDD containing values like this:

[
   (Key1, ([2,1,4,3,5],5)),
   (Key2, ([6,4,3,5,2],5)),
   (Key3, ([14,12,13,10,15],5)),
]

and I need to sort the value of the array part just like this:

[
   (Key1, ([1,2,3,4,5],5)),
   (Key2, ([2,3,4,5,6],5)),
   (Key3, ([10,12,13,14,15],5)),
]

I find two sorting methods for Spark: sortBy and sortbyKey. I tried the sortBy method like this:

myRDD.sortBy(lambda x: x[1][0])

But unfortunately, it sort data based on the first element of the array instead of sorting the elements of the array per se.

Also, the sortByKey seems not to help cause it just sorts the data based on the keys.

How can I achieve the sorted RDD?

CodePudding user response:

Try something like this:

rdd2 = rdd.map(lambda x: (x[0], sorted(x[1]), x[2]  ))
  • Related