Adding two struct fields in pyspark DataFrame-CodePudding

I have a DataFrame schema as follows (in PySpark):

root
 |-- ID: string (nullable = true)
 |-- 2022: struct (nullable = true)
 |    |-- val_1: int (nullable = true)
 |    |-- val_2: double (nullable = true)
 |    |-- val_3: double (nullable = true)
 |-- 2021: struct (nullable = true)
 |    |-- val_1: int (nullable = true)
 |    |-- val_2: double (nullable = true)
 |    |-- val_3: double (nullable = true)

I want to add 2021 and 2022 columns into one column AGG such that it will contain the sum of 2021 and 2022 for each row in the DataFrame.

I tried the following:

df.select(
    'ID',
    (F.col("2021")   F.col("2022")).alias("AGG")
).printSchema()

Where the desired output schema should be:

root
 |-- ID: string (nullable = true)
 |-- AGG: struct (nullable = true)
 |    |-- val_1: int (nullable = true)
 |    |-- val_2: double (nullable = true)
 |    |-- val_3: double (nullable = true)

But it seems that spark only supports adding numeric types. Is there a way to add them implicitly without writing it directly for each field in the struct?

CodePudding user response：

No, you can't addition to struct columns that way. You need to create a new struct column by summing the nested fields.

You can use a list comprehension over the list of inner fields to create a new column like this:

from pyspark.sql import functions as F


def add_struct(common_fields, s1, s2):
    return F.struct(*[
        (s1[f]   s2[f]).alias(f) for f in common_fields
    ])


# get list of struct fields from schema
fields = df.select("2022.*").columns

df.select(
    'ID',
    add_struct(fields, F.col("2021"), F.col("2022")).alias("AGG")
)