I have a very large dataset about twitter. I want to be able to compute the mean tweets per hour published by the user. I was able to groupby the tweets per hour per user but now how can I compute the mean per hour?
I'm not able to write all the code since the dataset has been heavily preprocessed. In the dataset I have as column user_id
and created_at
which is a timestamp of the tweet published, so I sorted by created_at
and than groupedby till hours
grouped_df = tweets_df.sort_values(["created_at"]).groupby([
tweets_df['user_id'],
tweets_df['created_at'].dt.year,
tweets_df['created_at'].dt.month,
tweets_df['created_at'].dt.day,
tweets_df['created_at'].dt.hour])
I can count the tweets per hours per user using
tweet_per_hour = grouped_df["created_at"].count()
print(tweet_per_hour)
what I obtain using this code is
user_id created_at created_at created_at created_at
678033 2012 3 11 2 1
14 1
17 1
18 1
4 13 4 1
..
3164941860 2020 4 30 7 6
9 2
5 1 1 2
9 6
2 6 1
Name: created_at, Length: 3829888, dtype: int64
where the last column is the count of the tweets per hours
678033 2012 3 11 2 1
indicates that user the 678033 in the day 2012-03-11 in the range of hour between 2 o'clock and 3 o'clock made just 1 tweet.
I need to sum all the tweets per hour made by the user and compute a mean for that user So I want as output for example
user_id average_tweets_per_hour
678033 4
665353 10
How can i do it?
CodePudding user response:
I'm not sure what the name of your columns are anymore, but it would be something like this:
grouped_df.reset_index().groupby("user_id").agg(avgTweetsPerHour = ('created_at','mean'))
As was commented above, I can't test this without enough information to reproduce it, but the .agg()
goes beautifully with .groupby()
CodePudding user response:
You may be able to do something as simple as
df.groupby('A').mean()
But as the other responses have noted it's difficult to know exactly what to suggest without something reproducible.