Home > Software design >  Convert Array values into Column name for another Array value in PySpark
Convert Array values into Column name for another Array value in PySpark

Time:08-05

I have the below input table with array values in it. To get the expected output I've tried array_contains, but wasn't able to get it.

The array index numbers should exactly match between the two columns

Input:
 --------------- ------------------ -------------------- 
|ID             |col_List          |col_ID              |
 --------------- ------------------ -------------------- 
|10001          |["T1","T2]        |["100","200"]       |
 --------------- ------------------ -------------------- 
|10002          |["T2"]            |["300"]             |
 --------------- ------------------ -------------------- 
|10003          |["T1","T2","T3"]  |["150","220","580"  |
 --------------- ------------------ -------------------- 

Expected output:
 --------------- -------- -------- -------- 
|ID             |T1      |T2      |T3      |
 --------------- -------- -------- -------- 
|10001          |100     |200     |null    |
 --------------- -------- -------- -------- 
|10002          |null    |300     |null    |
 --------------- -------- -------- -------- 
|10003          |150     |220     |580     |
 --------------- -------- -------- -------- 

CodePudding user response:

We can use arrays_zip to create a struct using the values of the arrays. So, first elements from both the arrays will be used to create a struct, then second elements will be used to create another struct, and so on. Explode the resulting array of structs to get individual rows for each values. We can then pivot the col_list and use col_id values.

data_sdf. \
    withColumn('col_struct_arr', func.arrays_zip('col_list', 'col_id')). \
    withColumn('col_struct', func.explode('col_struct_arr')). \
    select('id', func.col('col_struct.*')). \
    groupBy('id'). \
    pivot('col_list'). \
    agg(func.first('col_id')). \
    show()

#  ----- ---- --- ---- 
# |   id|  t1| t2|  t3|
#  ----- ---- --- ---- 
# |10002|null|300|null|
# |10001| 100|200|null|
# |10003| 150|220| 580|
#  ----- ---- --- ---- 

CodePudding user response:

After arrays_zip, you can do inline explosion. Then, pivot looks necessary.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(10001, ["T1","T2"], ["100","200"]),
     (10002, ["T2"], ["300"]),
     (10003, ["T1","T2","T3"], ["150","220","580"])],
    ['ID', 'col_List', 'col_ID'])

Script:

df = df.selectExpr("ID", "inline(arrays_zip(col_List, col_ID))")
df = df.groupBy('ID').pivot('col_List').agg(F.first('col_ID'))

df.show()
#  ----- ---- --- ---- 
# |ID   |T1  |T2 |T3  |
#  ----- ---- --- ---- 
# |10001|100 |200|null|
# |10002|null|300|null|
# |10003|150 |220|580 |
#  ----- ---- --- ---- 
  • Related