PySpark: creating aggregated columns out of a string type column different values-CodePudding

I have this dataframe:

 --------- -------- ------ 
|    topic| emotion|counts|
 --------- -------- ------ 
|    dog  | sadness|   4  |
|    cat  |surprise|   1  |
|    bird |    fear|   3  |
|    cat  |     joy|   2  |
|    dog  |surprise|  10  |
|    dog  |surprise|   3  |
 --------- -------- ------

And I want to create a column for every different emotion aggregating the counts for every topic and every emotion, ending up having an output like this:

 --------- -------- --------- ----- ---------- 
|    topic| fear   | sadness | joy | surprise |
 --------- -------- --------- ----- ---------- 
|    dog  |    0   |   4     |  0  |     13   |
|    cat  |    0   |   0     |  2  |     1    |
|    bird |    3   |   0     |  0  |     0    |
 --------- -------- --------- ----- ----------

This is what I tried so far, for the fear column but the rest of the emotions keep showing up for every topic, how can I get a result like the above?

agg_emotion = df.groupby("topic", "emotion") \
    .agg(F.sum(F.when(F.col("emotion").eqNullSafe("fear"), 1)\
              .otherwise(0)).alias('fear'))

CodePudding user response：

groupy sum then groupby pivot the outcome

df.groupby('topic','emotion').agg(sum('counts').alias('counts')).groupby('topic').pivot('emotion').agg(F.first('counts')).na.fill(0).show()

 ----- ---- --- ------- -------- 
|topic|fear|joy|sadness|surprise|
 ----- ---- --- ------- -------- 
|  dog|   0|  0|      4|      13|
|  cat|   0|  2|      0|       1|
| bird|   3|  0|      0|       0|
 ----- ---- --- ------- --------