How can I combine two array columns 'languages' and 'languages2' to come up with an array 'lang' which is array of struct:
languages1 = ["Java1","Scala1","C 1"]
languages2 = ["Java2","Scala2","C 2"]
I need to create a new column 'lang' with below data for the above row:
lang:
[ data:{
language:Java1,
languages2: Java2
},
data:{
language:scala1,
languages2: scala2
},
data:{
language:c 1,
languages2: c 2
}
]
CodePudding user response:
arrays_zip
seems could help.
Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(["Java1", "Scala1","C 1"], ["Java2", "Scala2","C 2"])],
['languages1', 'languages2'])
df = df.select(F.arrays_zip('languages1', 'languages2').alias('lang'))
df.show(truncate=0)
# ------------------------------------------------
# |lang |
# ------------------------------------------------
# |[{Java1, Java2}, {Scala1, Scala2}, {C 1, C 2}]|
# ------------------------------------------------
df.printSchema()
# root
# |-- lang: array (nullable = true)
# | |-- element: struct (containsNull = false)
# | | |-- languages1: string (nullable = true)
# | | |-- languages2: string (nullable = true)