I have a dataframe of schema -
|-- A: string (nullable = true)
|-- B: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
| | |-- z: double (nullable = true)
|-- C: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
I want to merge column B & C (array_union). But array_union is not working because of different data types of these columns. Structs of B & C have pretty much same columns except z. I don't care about z - whether it is present or not - in their merged output.
What would be a good way to achieve this?
CodePudding user response:
Transform the column 'C' like this and use the array_union after:
import pyspark.sql.functions as f
df = (df
.withColumn('z', f.expr('transform(C, element -> cast(1 AS double))'))
.withColumn('C', f.expr('transform(C, (element, idx) -> struct(element_at(C.x, idx 1) AS x, element_at(C.y, idx 1) AS y, element_at(z, idx 1) AS z))'))
.drop('z')
)
CodePudding user response:
Sure, drop Z in B and then array_join()
new = (df1.withColumn('B',expr("transform(B,s->struct(s.key as key,s.x as x, s.y as y))"))#drop Z
.withColumn('D', array_union(col('B'),col('C')))#array_join
.drop('B','C')#Drop B and C if not needed
).printSchema()
root
|-- A: string (nullable = false)
|-- D: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)