Home > database >  Combine array columns to create an array of struct
Combine array columns to create an array of struct

Time:09-21

How can I combine two array columns 'languages' and 'languages2' to come up with an array 'lang' which is array of struct:

languages1 = ["Java1","Scala1","C 1"]
languages2 = ["Java2","Scala2","C 2"]

I need to create a new column 'lang' with below data for the above row:

lang:
[ data:{
   language:Java1,
   languages2: Java2
   },
  data:{
   language:scala1,
   languages2: scala2
   },
  data:{
   language:c  1,
   languages2: c  2
   }
]

CodePudding user response:

arrays_zip seems could help.

Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(["Java1", "Scala1","C  1"], ["Java2", "Scala2","C  2"])],
    ['languages1', 'languages2'])


df = df.select(F.arrays_zip('languages1', 'languages2').alias('lang'))

df.show(truncate=0)
#  ------------------------------------------------ 
# |lang                                            |
#  ------------------------------------------------ 
# |[{Java1, Java2}, {Scala1, Scala2}, {C  1, C  2}]|
#  ------------------------------------------------ 

df.printSchema()
# root
#  |-- lang: array (nullable = true)
#  |    |-- element: struct (containsNull = false)
#  |    |    |-- languages1: string (nullable = true)
#  |    |    |-- languages2: string (nullable = true)
  • Related