I have the following challenge: I have a dataframe called hashtags_users_grouped which has the following structure:
hashtag_id | user_id | count
123 1 1
245 1 3
123 2 5
In each row, we find values that tell me when a certain user mentioned a certain hashtag and how many times he did it. In this example, user 1 mentioned hashtag 123 one time and 245 three times, while user 2 only mentioned hashtag 123 five times.
I want to have a dataframe with the following output:
user | 123 | 245
1 1 3
2 5 0
In other words, the same information as the first table, but with a column per hashtag, to know the amount of times a user mentioned each hashtag. I read the documentation and tried to run the following, without success:
a = hashtags_users_joined_grouped_df.groupBy("user_id").pivot("hashtag_id")
a.show(5)
I got the following error message:
AttributeError: 'GroupedData' object has no attribute 'show'
Do you know any way to do this?
CodePudding user response:
After applying pivot
you need to perform an aggregate, in this case the aggregate is first
as the count
metric has already been computed.
from pyspark.sql import functions as F
df = spark.createDataFrame([(123, 1, 1, ),
(245, 1, 3),
(123, 2, 5),],
("hashtag_id", "user_id", "count", ))
df.groupBy("user_id")\
.pivot("hashtag_id")\
.agg(F.first("count"))\
.show()
Output
------- --- ----
|user_id|123| 245|
------- --- ----
| 1| 1| 3|
| 2| 5|null|
------- --- ----