Home > database >  Compute mean of a groupbby pandas
Compute mean of a groupbby pandas

Time:11-05

I have a very large dataset about twitter. I want to be able to compute the mean tweets per hour published by the user. I was able to groupby the tweets per hour per user but now how can I compute the mean per hour?

I'm not able to write all the code since the dataset has been heavily preprocessed. In the dataset I have as column user_id and created_at which is a timestamp of the tweet published, so I sorted by created_at and than groupedby till hours

grouped_df = tweets_df.sort_values(["created_at"]).groupby([
    tweets_df['user_id'],
    tweets_df['created_at'].dt.year, 
    tweets_df['created_at'].dt.month,
    tweets_df['created_at'].dt.day,
    tweets_df['created_at'].dt.hour])

I can count the tweets per hours per user using

tweet_per_hour = grouped_df["created_at"].count()

print(tweet_per_hour)

what I obtain using this code is

user_id     created_at  created_at  created_at  created_at
678033      2012        3           11          2             1
                                                14            1
                                                17            1
                                                18            1
                        4           13          4             1
                                                             ..
3164941860  2020        4           30          7             6
                                                9             2
                        5           1           1             2
                                                9             6
                                    2           6             1
Name: created_at, Length: 3829888, dtype: int64

where the last column is the count of the tweets per hours

678033      2012        3           11          2             1

indicates that user the 678033 in the day 2012-03-11 in the range of hour between 2 o'clock and 3 o'clock made just 1 tweet.

I need to sum all the tweets per hour made by the user and compute a mean for that user So I want as output for example

user_id     average_tweets_per_hour
678033      4
665353      10

How can i do it?

CodePudding user response:

I'm not sure what the name of your columns are anymore, but it would be something like this:

grouped_df.reset_index().groupby("user_id").agg(avgTweetsPerHour = ('created_at','mean'))

As was commented above, I can't test this without enough information to reproduce it, but the .agg() goes beautifully with .groupby()

CodePudding user response:

You may be able to do something as simple as

df.groupby('A').mean()

But as the other responses have noted it's difficult to know exactly what to suggest without something reproducible.

  • Related