I have the below input table with array values in it. To get the expected output I've tried array_contains
, but wasn't able to get it.
The array index numbers should exactly match between the two columns
Input:
--------------- ------------------ --------------------
|ID |col_List |col_ID |
--------------- ------------------ --------------------
|10001 |["T1","T2] |["100","200"] |
--------------- ------------------ --------------------
|10002 |["T2"] |["300"] |
--------------- ------------------ --------------------
|10003 |["T1","T2","T3"] |["150","220","580" |
--------------- ------------------ --------------------
Expected output:
--------------- -------- -------- --------
|ID |T1 |T2 |T3 |
--------------- -------- -------- --------
|10001 |100 |200 |null |
--------------- -------- -------- --------
|10002 |null |300 |null |
--------------- -------- -------- --------
|10003 |150 |220 |580 |
--------------- -------- -------- --------
CodePudding user response:
We can use arrays_zip
to create a struct using the values of the arrays. So, first elements from both the arrays will be used to create a struct, then second elements will be used to create another struct, and so on. Explode the resulting array of structs to get individual rows for each values. We can then pivot the col_list
and use col_id
values.
data_sdf. \
withColumn('col_struct_arr', func.arrays_zip('col_list', 'col_id')). \
withColumn('col_struct', func.explode('col_struct_arr')). \
select('id', func.col('col_struct.*')). \
groupBy('id'). \
pivot('col_list'). \
agg(func.first('col_id')). \
show()
# ----- ---- --- ----
# | id| t1| t2| t3|
# ----- ---- --- ----
# |10002|null|300|null|
# |10001| 100|200|null|
# |10003| 150|220| 580|
# ----- ---- --- ----
CodePudding user response:
After arrays_zip
, you can do inline
explosion. Then, pivot
looks necessary.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(10001, ["T1","T2"], ["100","200"]),
(10002, ["T2"], ["300"]),
(10003, ["T1","T2","T3"], ["150","220","580"])],
['ID', 'col_List', 'col_ID'])
Script:
df = df.selectExpr("ID", "inline(arrays_zip(col_List, col_ID))")
df = df.groupBy('ID').pivot('col_List').agg(F.first('col_ID'))
df.show()
# ----- ---- --- ----
# |ID |T1 |T2 |T3 |
# ----- ---- --- ----
# |10001|100 |200|null|
# |10002|null|300|null|
# |10003|150 |220|580 |
# ----- ---- --- ----