I have a DataFrame schema as follows (in PySpark):
root
|-- ID: string (nullable = true)
|-- 2022: struct (nullable = true)
| |-- val_1: int (nullable = true)
| |-- val_2: double (nullable = true)
| |-- val_3: double (nullable = true)
|-- 2021: struct (nullable = true)
| |-- val_1: int (nullable = true)
| |-- val_2: double (nullable = true)
| |-- val_3: double (nullable = true)
I want to add 2021
and 2022
columns into one column AGG
such that it will contain the sum of 2021
and 2022
for each row in the DataFrame.
I tried the following:
df.select(
'ID',
(F.col("2021") F.col("2022")).alias("AGG")
).printSchema()
Where the desired output schema should be:
root
|-- ID: string (nullable = true)
|-- AGG: struct (nullable = true)
| |-- val_1: int (nullable = true)
| |-- val_2: double (nullable = true)
| |-- val_3: double (nullable = true)
But it seems that spark only supports adding numeric types. Is there a way to add them implicitly without writing it directly for each field in the struct?
CodePudding user response:
No, you can't addition to struct columns that way. You need to create a new struct column by summing the nested fields.
You can use a list comprehension over the list of inner fields to create a new column like this:
from pyspark.sql import functions as F
def add_struct(common_fields, s1, s2):
return F.struct(*[
(s1[f] s2[f]).alias(f) for f in common_fields
])
# get list of struct fields from schema
fields = df.select("2022.*").columns
df.select(
'ID',
add_struct(fields, F.col("2021"), F.col("2022")).alias("AGG")
)