Ordering struct elements nested in an array-CodePudding

I have a schema of a nested Struct within an Array. I want to order the columns of the nested struct alphabetically.

This question gave a complex function, but it does not work for structs nested in arrays. Any Help is appreciated.

I am working with PySpark 3.2.1.

My Schema:

root
 |-- id: integer (nullable = true)
 |-- values: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Dep: string (nullable = true)
 |    |    |-- ABC: string (nullable = true)

How it should look:

root
 |-- id: integer (nullable = true)
 |-- values: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ABC: string (nullable = true)
 |    |    |-- Dep: string (nullable = true)

Reproducible Example:

data = [
    (10, [{"Dep": 10, "ABC": 1}, {"Dep": 10, "ABC": 1}]),
    (20, [{"Dep": 20, "ABC": 1}, {"Dep": 20, "ABC": 1}]),
    (30, [{"Dep": 30, "ABC": 1}, {"Dep": 30, "ABC": 1}]),
    (40, [{"Dep": 40, "ABC": 1}, {"Dep": 40, "ABC": 1}])
  ]
myschema = StructType(
[
    StructField("id", IntegerType(), True),
    StructField("values",
                ArrayType(
                    StructType([
                        StructField("Dep", StringType(), True),
                        StructField("ABC", StringType(), True)
                    ])
    ))
]
)
df = spark.createDataFrame(data=data, schema=myschema)
df.printSchema()
df.show(10, False)

CodePudding user response：

Not covering all cases, but as a start for your current df, you can get the list of fields from the inner structs, sort them, then using transform function to update each struct element like this:

from pyspark.sql import functions as F

fields = sorted(df.selectExpr("inline(values)").columns)

df1 = df.withColumn(
    "values", 
    F.transform("values", lambda x: F.struct(*[x[f].alias(f) for f in fields]))
)

df1.printSchema()
#root
# |-- id: integer (nullable = true)
# |-- values: array (nullable = true)
# |    |-- element: struct (containsNull = false)
# |    |    |-- ABC: string (nullable = true)
# |    |    |-- Dep: string (nullable = true)

CodePudding user response：

I found an extremely hacky solution, so if anyone knows a better one, be my guest to add another answer.

Retrieving the array[struct]-elements as their own array-columns
Zipping them back together as a struct in the correct order

Code:

selexpr = ["id", "values.ABC as ABC", "values.Dep as Dep"]
df = df.selectExpr(selexpr)
df = df.withColumn(
  "zipped", arrays_zip("ABC", "Dep")  # order of the column-names results in ordering!
)