I have a pyspark dataframe with a column Categories
that has a list of values, for a given ID. Looks like this, let's call this df1
:
ID | Categories
122 [16777277, 16777350, 16777483, 16777425, 16777303]
There are about 500 categories total, but we can see in the example above that ID 122 only has the above 5 categories. Let's call this df2.columns
I want transform each value in the list from the Categories
column into its own column, like this (df3):
ID | 16777277 | 16777350 | 16777483 | 16777425 | 44444 | 55555 | .......
122 1 1 1 1 0 0
Any idea how I can make this transformation?
I tried using explode()
then pivot()
however, this seemed to incorrectly map the list values to columns.
What I tried:
df1 = df1.withColumn('exploded', F.explode('Categories'))
pivoted = df1.groupby('ID').pivot('exploded').count()
Thanks in advance!
CodePudding user response:
On the right track, being a list, explode, groupby and Pivot and lit(1)
df1=spark.createDataFrame([(122 , [16777277, 16777350, 16777483, 16777425, 16777303])],
('ID' , 'Categories'))
df1.withColumn('exploded', F.explode('Categories')).groupBy("ID").pivot("exploded").agg(F.lit(1)).na.fill(0).show()
--- -------- -------- -------- -------- --------
| ID|16777277|16777303|16777350|16777425|16777483|
--- -------- -------- -------- -------- --------
|122| 1| 1| 1| 1| 1|
--- -------- -------- -------- -------- --------