I have this dataframe:
--------- -------- ------
| topic| emotion|counts|
--------- -------- ------
| dog | sadness| 4 |
| cat |surprise| 1 |
| bird | fear| 3 |
| cat | joy| 2 |
| dog |surprise| 10 |
| dog |surprise| 3 |
--------- -------- ------
And I want to create a column for every different emotion aggregating the counts for every topic and every emotion, ending up having an output like this:
--------- -------- --------- ----- ----------
| topic| fear | sadness | joy | surprise |
--------- -------- --------- ----- ----------
| dog | 0 | 4 | 0 | 13 |
| cat | 0 | 0 | 2 | 1 |
| bird | 3 | 0 | 0 | 0 |
--------- -------- --------- ----- ----------
This is what I tried so far, for the fear column but the rest of the emotions keep showing up for every topic, how can I get a result like the above?
agg_emotion = df.groupby("topic", "emotion") \
.agg(F.sum(F.when(F.col("emotion").eqNullSafe("fear"), 1)\
.otherwise(0)).alias('fear'))
CodePudding user response:
groupy sum then groupby pivot the outcome
df.groupby('topic','emotion').agg(sum('counts').alias('counts')).groupby('topic').pivot('emotion').agg(F.first('counts')).na.fill(0).show()
----- ---- --- ------- --------
|topic|fear|joy|sadness|surprise|
----- ---- --- ------- --------
| dog| 0| 0| 4| 13|
| cat| 0| 2| 0| 1|
| bird| 3| 0| 0| 0|
----- ---- --- ------- --------