Home > database >  update field in struct column pyspark - withField does not replace
update field in struct column pyspark - withField does not replace

Time:02-11

I have a table with a column with dates, which I want to use to update the value of field in a struct that I define for a new column.

Simple table:

date 
----------
1960-12-01

Struct:

value_type = T.StructType(
    [
       T.StructField("extra",T.MapType(T.StringType(), T.StringType(), True), True),
       T.StructField("date", T.StringType(), True),
       T.StructField("from_date", T.StringType(), True),
       T.StructField("to_date", T.StringType(), True),
       T.StructField("value", T.StringType(), True),
    ]
)

Which I use to create a new empty column which I then want to update with this code:

dateOfBirth_df = (
  df
 .withColumn("dob_new", (F.lit(None).cast(value_type)))
 .withColumn("dob_new", F.col("dob_new").withField("value", F.lit("date"))))

Which works, but it the dob_new.date field does not get updated. Anyone knows what I am doing wrong?

CodePudding user response:

When you do:

F.lit(None).cast(value_type)

It is creating a column with value NULL. So withField on null value will always give null.

You can try this instead:

from pyspark.sql import functions as F

dateOfBirth_df = df.withColumn(
    "dob_new",
    F.struct(*[F.lit(None).cast(f.dataType).alias(f.name) for f in value_type.fields])
).withColumn(
    "dob_new",
    F.col("dob_new").withField("value", F.col("date"))
)

dateOfBirth_df.show(truncate=False)
# ---------- ------------------------------------ 
#|date      |dob_new                             |
# ---------- ------------------------------------ 
#|1960-12-01|{null, null, null, null, 1960-12-01}|
# ---------- ------------------------------------ 
  • Related