Home > Enterprise >  How to pivot a Pyspark Dataframe
How to pivot a Pyspark Dataframe

Time:11-28

I have the following challenge: I have a dataframe called hashtags_users_grouped which has the following structure:

hashtag_id  |  user_id  |  count
   123            1          1
   245            1          3
   123            2          5

In each row, we find values that tell me when a certain user mentioned a certain hashtag and how many times he did it. In this example, user 1 mentioned hashtag 123 one time and 245 three times, while user 2 only mentioned hashtag 123 five times.

I want to have a dataframe with the following output:

user  |     123  |          245
1            1               3
2            5               0

In other words, the same information as the first table, but with a column per hashtag, to know the amount of times a user mentioned each hashtag. I read the documentation and tried to run the following, without success:

a = hashtags_users_joined_grouped_df.groupBy("user_id").pivot("hashtag_id")
a.show(5)

I got the following error message:

AttributeError: 'GroupedData' object has no attribute 'show'

Do you know any way to do this?

CodePudding user response:

After applying pivot you need to perform an aggregate, in this case the aggregate is first as the count metric has already been computed.

from pyspark.sql import functions as F

df = spark.createDataFrame([(123, 1, 1, ), 
                            (245, 1, 3), 
                            (123, 2, 5),], 
                           ("hashtag_id", "user_id", "count", ))

df.groupBy("user_id")\
  .pivot("hashtag_id")\
  .agg(F.first("count"))\
  .show()

Output

 ------- --- ---- 
|user_id|123| 245|
 ------- --- ---- 
|      1|  1|   3|
|      2|  5|null|
 ------- --- ---- 
  • Related