How to pivot a Pyspark Dataframe-CodePudding

I have the following challenge: I have a dataframe called hashtags_users_grouped which has the following structure:

hashtag_id  |  user_id  |  count
   123            1          1
   245            1          3
   123            2          5

In each row, we find values that tell me when a certain user mentioned a certain hashtag and how many times he did it. In this example, user 1 mentioned hashtag 123 one time and 245 three times, while user 2 only mentioned hashtag 123 five times.

I want to have a dataframe with the following output:

user  |     123  |          245
1            1               3
2            5               0

In other words, the same information as the first table, but with a column per hashtag, to know the amount of times a user mentioned each hashtag. I read the documentation and tried to run the following, without success:

a = hashtags_users_joined_grouped_df.groupBy("user_id").pivot("hashtag_id")
a.show(5)

I got the following error message:

AttributeError: 'GroupedData' object has no attribute 'show'

Do you know any way to do this?

CodePudding user response：

After applying pivot you need to perform an aggregate, in this case the aggregate is first as the count metric has already been computed.

from pyspark.sql import functions as F

df = spark.createDataFrame([(123, 1, 1, ), 
                            (245, 1, 3), 
                            (123, 2, 5),], 
                           ("hashtag_id", "user_id", "count", ))

df.groupBy("user_id")\
  .pivot("hashtag_id")\
  .agg(F.first("count"))\
  .show()

Output

 ------- --- ---- 
|user_id|123| 245|
 ------- --- ---- 
|      1|  1|   3|
|      2|  5|null|
 ------- --- ----