I have a table with a column with dates, which I want to use to update the value of field in a struct that I define for a new column.
Simple table:
date
----------
1960-12-01
Struct:
value_type = T.StructType(
[
T.StructField("extra",T.MapType(T.StringType(), T.StringType(), True), True),
T.StructField("date", T.StringType(), True),
T.StructField("from_date", T.StringType(), True),
T.StructField("to_date", T.StringType(), True),
T.StructField("value", T.StringType(), True),
]
)
Which I use to create a new empty column which I then want to update with this code:
dateOfBirth_df = (
df
.withColumn("dob_new", (F.lit(None).cast(value_type)))
.withColumn("dob_new", F.col("dob_new").withField("value", F.lit("date"))))
Which works, but it the dob_new.date
field does not get updated.
Anyone knows what I am doing wrong?
CodePudding user response:
When you do:
F.lit(None).cast(value_type)
It is creating a column with value NULL
. So withField
on null value will always give null.
You can try this instead:
from pyspark.sql import functions as F
dateOfBirth_df = df.withColumn(
"dob_new",
F.struct(*[F.lit(None).cast(f.dataType).alias(f.name) for f in value_type.fields])
).withColumn(
"dob_new",
F.col("dob_new").withField("value", F.col("date"))
)
dateOfBirth_df.show(truncate=False)
# ---------- ------------------------------------
#|date |dob_new |
# ---------- ------------------------------------
#|1960-12-01|{null, null, null, null, 1960-12-01}|
# ---------- ------------------------------------