|-- x: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- y: struct (nullable = true)
| | |-- z: struct (nullable = true)
| | | |-- aa: string (nullable = true)
I have the above nested schema where I want to change column z from struct to string.
|-- x: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- y: struct (nullable = true)
| | |-- z: string (nullable = true)
I'm not using Spark 3 but Spark 2.4.x. Will prefer Scala way but python works too since this is a one time manual thing to backfill some past data.
Is there a way to do this with some udf or any other way?
I know it's easy to do this via to_json but the nested array of struct is causing issues.
CodePudding user response:
For your specific case, you can do it with built-in functions on Spark 2.4 or Spark 3.0
Spark 2.4
You can use arrays_zip
as follows:
- first, you create arrays for each field you want to have as struct element of your array
- second, you use
arrays_zip
to zip those fields
Here is the complete code, with df
your input dataframe:
import org.apache.spark.functions.{arrays_zip, col}
df.withColumn("x",
arrays_zip(
col("x").getField("y").alias("y"),
col("x").getField("z").getField("aa").alias("z")
))
Spark 3.0
You can use transform
to rebuild element struct of your array, as follows:
df.withColumn("x", transform(
col("x"),
element => struct(
element.getField("y").alias("y"),
element.getField("z").getField("aa").alias("z")
)
))
CodePudding user response:
cast as in higher order function
df3=df.withColumn('x', expr('transform(x, s-> struct(s.y,cast(s.z as string) as z))')).printSchema()
root
|-- x: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- y: struct (nullable = true)
| | |-- z: string (nullable = true)