How to "select" columns from an array of struct?-CodePudding

I have an array of structs and I want to create a new column, but only select a few columns from the struct.

Currently, I am doing it this way:

F.array(F.struct(
    F.col('orig_column.item_1'),
    F.col('orig_column.item_2'),
    F.col('orig_column.item_3')))

However, the schema is not what I want. I get this:

root
 |-- orig_column: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- item_1: long (nullable = true)
 |    |    |-- item_2: long (nullable = true)
 |    |    |-- item_3: long (nullable = true)
 |    |    |-- item_4: integer (nullable = true)
 |    |    |-- item_5: integer (nullable = true)
 |    |    |-- item_6: integer (nullable = true)
 |    |    |-- item_7: boolean (nullable = true)
 |-- new_column: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- item_1: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)
 |    |    |-- item_2: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)
 |    |    |-- item_3: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)
 |    |    |-- item_4: array (nullable = true)
 |    |    |    |-- element: integer (containsNull = true)

What I wanted to happen is:

root
 |-- new_column: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- item_1: long (nullable = true)
 |    |    |-- item_2: long (nullable = true)
 |    |    |-- item_3: long (nullable = true)

What am I doing wrong?

CodePudding user response：

Since the structs are inside an array, you can "loop" through every element of the array using transform to transform it to whatever you like. The following code selects only desired fields.

Spark 3.1

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [([(2, 2, 2, 2,)],)],
    'orig_column:array<struct<item_1:bigint,item_2:bigint,item_3:bigint,item_4:int>>'
)

df = df.withColumn(
    'new_column',
    F.transform(
        "orig_column", 
        lambda x: F.struct(
            x.item_1.alias('item_1'), 
            x.item_2.alias("item_2"), 
            x.item_3.alias("item_3"))
    )
)

df.printSchema()
# root
#  |-- orig_column: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- item_1: long (nullable = true)
#  |    |    |-- item_2: long (nullable = true)
#  |    |    |-- item_3: long (nullable = true)
#  |    |    |-- item_4: integer (nullable = true)
#  |-- new_column: array (nullable = true)
#  |    |-- element: struct (containsNull = false)
#  |    |    |-- item_1: long (nullable = true)
#  |    |    |-- item_2: long (nullable = true)
#  |    |    |-- item_3: long (nullable = true)

Spark 3.1.2

df = df.withColumn(
    'new_column',
    F.expr("transform(orig_column, x -> struct(x.item_1, x.item_2, x.item_3))")
)

Spark 2.4

df = df.withColumn(
    'new_column',
    F.expr("transform(orig_column, x -> struct(x.item_1 as item_1, x.item_2 as item_2, x.item_3 as item_3))")
)