Home > database >  Assign list of values to columns with binary value in PySpark
Assign list of values to columns with binary value in PySpark

Time:05-24

I have a pyspark dataframe with a column Categories that has a list of values, for a given ID. Looks like this, let's call this df1:

ID  | Categories
122   [16777277, 16777350, 16777483, 16777425, 16777303]

There are about 500 categories total, but we can see in the example above that ID 122 only has the above 5 categories. Let's call this df2.columns

I want transform each value in the list from the Categories column into its own column, like this (df3):

ID  | 16777277 | 16777350 | 16777483 | 16777425 | 44444 | 55555 | ....... 
122    1            1        1           1         0       0

Any idea how I can make this transformation? I tried using explode() then pivot() however, this seemed to incorrectly map the list values to columns.

What I tried:

df1 = df1.withColumn('exploded', F.explode('Categories'))
pivoted = df1.groupby('ID').pivot('exploded').count()

Thanks in advance!

CodePudding user response:

On the right track, being a list, explode, groupby and Pivot and lit(1)

 df1=spark.createDataFrame([(122  , [16777277, 16777350, 16777483, 16777425, 16777303])],
    ('ID'  , 'Categories'))


df1.withColumn('exploded', F.explode('Categories')).groupBy("ID").pivot("exploded").agg(F.lit(1)).na.fill(0).show()

 --- -------- -------- -------- -------- -------- 
| ID|16777277|16777303|16777350|16777425|16777483|
 --- -------- -------- -------- -------- -------- 
|122|       1|       1|       1|       1|       1|
 --- -------- -------- -------- -------- -------- 
  • Related