Home > database >  Adding two struct fields in pyspark DataFrame
Adding two struct fields in pyspark DataFrame

Time:03-02

I have a DataFrame schema as follows (in PySpark):

root
 |-- ID: string (nullable = true)
 |-- 2022: struct (nullable = true)
 |    |-- val_1: int (nullable = true)
 |    |-- val_2: double (nullable = true)
 |    |-- val_3: double (nullable = true)
 |-- 2021: struct (nullable = true)
 |    |-- val_1: int (nullable = true)
 |    |-- val_2: double (nullable = true)
 |    |-- val_3: double (nullable = true)

I want to add 2021 and 2022 columns into one column AGG such that it will contain the sum of 2021 and 2022 for each row in the DataFrame.

I tried the following:

df.select(
    'ID',
    (F.col("2021")   F.col("2022")).alias("AGG")
).printSchema()

Where the desired output schema should be:

root
 |-- ID: string (nullable = true)
 |-- AGG: struct (nullable = true)
 |    |-- val_1: int (nullable = true)
 |    |-- val_2: double (nullable = true)
 |    |-- val_3: double (nullable = true)

But it seems that spark only supports adding numeric types. Is there a way to add them implicitly without writing it directly for each field in the struct?

CodePudding user response:

No, you can't addition to struct columns that way. You need to create a new struct column by summing the nested fields.

You can use a list comprehension over the list of inner fields to create a new column like this:

from pyspark.sql import functions as F


def add_struct(common_fields, s1, s2):
    return F.struct(*[
        (s1[f]   s2[f]).alias(f) for f in common_fields
    ])


# get list of struct fields from schema
fields = df.select("2022.*").columns

df.select(
    'ID',
    add_struct(fields, F.col("2021"), F.col("2022")).alias("AGG")
)
  • Related