Home > OS >  Sort array of structs
Sort array of structs

Time:08-09

Sorting array of structs on the first struct field is straightforward.

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [([("e", 2, 20), ("f", 2, 10), ("d", 2, 30), ("b", 1, 20), ("c", 1, 10), ("a", 1, 30)],)],
    'col1 array<struct<f1:string,f2:int,f3:int>>')

df.printSchema()
# root
#  |-- col1: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- f1: string (nullable = true)
#  |    |    |-- f2: integer (nullable = true)
#  |    |    |-- f3: integer (nullable = true)

df.show(truncate=0)
#  ------------------------------------------------------------------------ 
# |col1                                                                    |
#  ------------------------------------------------------------------------ 
# |[{e, 2, 20}, {f, 2, 10}, {d, 2, 30}, {b, 1, 20}, {c, 1, 10}, {a, 1, 30}]|
#  ------------------------------------------------------------------------ 

The above data can be sorted like this:

df = df.withColumn("col1", F.sort_array("col1"))
df.show(truncate=0)
#  ------------------------------------------------------------------------ 
# |col1                                                                    |
#  ------------------------------------------------------------------------ 
# |[{a, 1, 30}, {b, 1, 20}, {c, 1, 10}, {d, 2, 30}, {e, 2, 20}, {f, 2, 10}]|
#  ------------------------------------------------------------------------ 

But how to sort based on several fields, with differing orders?

E.g. how to sort by f2 asc, f3 desc?

(In this particular example, the result will be the same as above - abcdef.)

CodePudding user response:

Easy way

If you can achieve BOTH of these, you will have a simpler code:

  • rearrange struct fields so that fields for sorting would be placed at the beginning
  • modify the values in fields for sorting so that the order would be the same for all the sort fields (e.g. only ascending)

If you're lucky to have both conditions satisfied, just do F.sort_array("col1")).

For the rest of us, let's continue.

Tip: when possible, we can create new struct fields at the beginning of struct just in order to use the simple sorting method (there's an example in a few sentences below).

Rearranging fields in structs of array can be done like this:

df = df.withColumn("col1", F.expr("transform(col1, x -> struct(x.f2, x.f3, x.f1))"))

df.show(truncate=0)
#  ------------------------------------------------------------------------ 
# |col1                                                                    |
#  ------------------------------------------------------------------------ 
# |[{2, 20, e}, {2, 10, f}, {2, 30, d}, {1, 20, b}, {1, 10, c}, {1, 30, a}]|
#  ------------------------------------------------------------------------ 

Modifying values in order to equalize the order type can easily be done if you deal with integers. E.g. if we want the final sort to be f2 asc, f3 desc, we can add - sign before f3, so that we could use the same (e.g. ascending) order type for both fields.

df = df.withColumn("col1", F.expr("transform(col1, x -> struct(x.f2, -x.f3, x.f1))"))

df.show(truncate=0)
#  ------------------------------------------------------------------------------ 
# |col1                                                                          |
#  ------------------------------------------------------------------------------ 
# |[{2, -20, e}, {2, -10, f}, {2, -30, d}, {1, -20, b}, {1, -10, c}, {1, -30, a}]|
#  ------------------------------------------------------------------------------ 

Sorting f2 asc, f3 desc (f3 was modified, so that asc would work for both fields). The goal was to get abcdef:

df = df.withColumn("col1", F.sort_array("col1"))

df.show(truncate=0)
#  ------------------------------------------------------------------------------ 
# |col1                                                                          |
#  ------------------------------------------------------------------------------ 
# |[{1, -30, a}, {1, -20, b}, {1, -10, c}, {2, -30, d}, {2, -20, e}, {2, -10, f}]|
#  ------------------------------------------------------------------------------ 

Another example if you care not to change the values and/or order inside the struct. struct(x.f2, -x.f3) _sort inner struct is created just for ordering at the beginning and immediately after the sort it's removed.

df = df.withColumn("col1", F.expr("sort_array(transform(col1, x -> struct(struct(x.f2, -x.f3) _sort, x.f1, x.f2, x.f3)))"))
df = df.withColumn("col1", F.expr("transform(col1, x -> struct(x.f1, x.f2, x.f3))"))

df.show(truncate=0)
#  ------------------------------------------------------------------------ 
# |col1                                                                    |
#  ------------------------------------------------------------------------ 
# |[{a, 1, 30}, {b, 1, 20}, {c, 1, 10}, {d, 2, 30}, {e, 2, 20}, {f, 2, 10}]|
#  ------------------------------------------------------------------------ 

More elaborate way

Comparator function may be needed in more demanding cases. It is passed as the second parameter in array_sort function in SQL API. PySpark doesn't have the option for such parameter. In the function, l means left, r means right. It loops through elements in array and finds the position for them based on the specified case conditions.

To make the ordering f2 asc, f3 desc, we first describe conditions for f2, then for f3.

df = df.withColumn("col1", F.expr("""
    array_sort(
        col1,
        (l, r) -> case when l.f2 < r.f2 then -1
                       when l.f2 > r.f2 then 1
                       when l.f3 > r.f3 then -1
                       when l.f3 < r.f3 then 1
                       else 0
                  end)
    """
))

df.show(truncate=0)
#  ------------------------------------------------------------------------ 
# |col1                                                                    |
#  ------------------------------------------------------------------------ 
# |[{a, 1, 30}, {b, 1, 20}, {c, 1, 10}, {d, 2, 30}, {e, 2, 20}, {f, 2, 10}]|
#  ------------------------------------------------------------------------ 
  • Related