Home > Blockchain >  Spark - Merge two columns of array struct type
Spark - Merge two columns of array struct type

Time:06-11

I have a dataframe of schema -

|-- A: string (nullable = true)
|-- B: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- key: string (nullable = true)
|    |    |-- x: double (nullable = true)
|    |    |-- y: double (nullable = true)
|    |    |-- z: double (nullable = true)
|-- C: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- key: string (nullable = true)
|    |    |-- x: double (nullable = true)
|    |    |-- y: double (nullable = true)

I want to merge column B & C (array_union). But array_union is not working because of different data types of these columns. Structs of B & C have pretty much same columns except z. I don't care about z - whether it is present or not - in their merged output.

What would be a good way to achieve this?

CodePudding user response:

Transform the column 'C' like this and use the array_union after:

import pyspark.sql.functions as f
df = (df
      .withColumn('z', f.expr('transform(C, element -> cast(1 AS double))'))
      .withColumn('C', f.expr('transform(C, (element, idx) -> struct(element_at(C.x, idx   1) AS x, element_at(C.y, idx   1) AS y, element_at(z, idx   1) AS z))'))
      .drop('z')
     )

CodePudding user response:

Sure, drop Z in B and then array_join()

new = (df1.withColumn('B',expr("transform(B,s->struct(s.key as key,s.x as x, s.y as y))"))#drop Z
       .withColumn('D', array_union(col('B'),col('C')))#array_join
       .drop('B','C')#Drop B and C if not needed
      ).printSchema()

root
 |-- A: string (nullable = false)
 |-- D: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- x: double (nullable = true)
 |    |    |-- y: double (nullable = true)
  • Related