Home > Software engineering >  PySpark Order by Map column Values
PySpark Order by Map column Values

Time:08-02

I have a dataframe which looks like this

ID     Col1                  Col2              Col3                     
1  [a1 ->1, a12 -> 5],[b1 -> 23, b23 -> 4],[c12 -> 12, c24 -> 23]
2  [a2 -> 2,a15 -> 4],[b3 -> 1, b45 -> 3],[c54 -> 23, c23 ->4]
3  [a17 -> 1,a88 -> 2],[b15 ->8, b90 -> 1],[c98 -> 5, c54 ->1]

I want to sort the Col1, Col2 and Col3 based on the Values of Map in Descending order

ID     Col1                  Col2              Col3                     
1  [a12 ->5, a1 -> 1],[b1 -> 23, b23 -> 4],[c24 -> 23, c12 -> 12]
2  [a15 -> 4,a2 -> 2],[b45 -> 3, b3 -> 1],[c54 -> 23, c23 ->4]
3  [a88 -> 2,a17 -> 1],[b15 ->8, b90 -> 1],[c98 -> 5, c54 ->1]

My Approach

df = df.select('id', F.explode('Col1')) \
    .withColumn('rn', F.expr('row_number() over (partition by id order by value desc)')) \
    .filter('rn <= 2') \
    .groupBy('id') \
    .agg(F.map_from_entries(F.collect_list(F.struct('key', 'value'))))
df.show(truncate=False)

This does the job only for 1 column, I want to do it for multiple

CodePudding user response:

We can use map_entries to create an array of structs of key-value pairs. Use transform on the array of structs to update to struct to value-key pairs. This updated array of structs can be sorted in descending using sort_array - It is sorted by the first element of the struct and then second element. Again reverse the structs to get key-value pairs. Use map_from_entries on the final sorted array of structs to get back a map field.

Here's an example

data_sdf. \
    withColumn('map_col_entries_desc_sort', 
               func.sort_array(func.transform(func.map_entries('map_col'), lambda x: func.struct(x.value.alias('value'), x.key.alias('key'))), 
                               asc=False)
               ). \
    withColumn('map_col_new', 
               func.map_from_entries(func.transform('map_col_entries_desc_sort', lambda k: func.struct(k.key.alias('key'), k.value.alias('value'))))
               ). \
    select('map_col', 'map_col_entries_desc_sort', 'map_col_new'). \
    show(truncate=False)

#  ----------------------------------------------------------- ----------------------------------------------------------- ----------------------------------------------------------- 
# |map_col                                                    |map_col_entries_desc_sort                                  |map_col_new                                                |
#  ----------------------------------------------------------- ----------------------------------------------------------- ----------------------------------------------------------- 
# |{a1 -> 1.1, b1 -> 2.1, a12 -> 5.5, b23 -> 2.5, c12 -> 2.98}|[{5.5, a12}, {2.98, c12}, {2.5, b23}, {2.1, b1}, {1.1, a1}]|{a12 -> 5.5, c12 -> 2.98, b23 -> 2.5, b1 -> 2.1, a1 -> 1.1}|
#  ----------------------------------------------------------- ----------------------------------------------------------- ----------------------------------------------------------- 
  • Related