Starting from a DataFrame with a date
and user
column, I'd like to add a third count_past_5_days
column to indicate the rolling count of occurrences of each row's user during the past 5 days:
date | user | count_past_5_days |
---|---|---|
2020-01-01 | abc | 1 |
2020-01-01 | def | 1 |
2020-01-02 | abc | 2 |
2020-01-03 | abc | 3 |
2020-01-04 | abc | 4 |
2020-01-04 | def | 2 |
2020-01-04 | ghi | 1 |
2020-01-05 | abc | 5 |
2020-01-06 | abc | 5 |
2020-01-07 | abc | 5 |
I've tried the following:
df.set_index('date').rolling('5D')['user'].count()
But this gets the total count for the past five rolling days, not just for the specific user of the current row. How to get this rolling count, for each row's specific user only?
CodePudding user response:
Try this, you can chain rolling
on groupby
:
df.set_index('date').groupby('user')['user']\
.rolling('5D')\
.count()\
.rename('count_past_5_days')\
.reset_index()\
.sort_values('date')
Output:
user date count_past_5_days
0 abc 2020-01-01 1.0
1 def 2020-01-01 1.0
2 abc 2020-01-02 2.0
3 abc 2020-01-03 3.0
4 abc 2020-01-04 4.0
5 def 2020-01-04 2.0
6 ghi 2020-01-04 1.0
7 abc 2020-01-05 5.0
8 abc 2020-01-06 5.0
9 abc 2020-01-07 5.0
CodePudding user response:
You can perform a sum across 'dummy' column whose values are all 1. This is the same approach that pd.crosstab
uses under the hood- though we can name our output column directly.
out = (
df.assign(count_past_5_days=1)
.groupby('user')
.rolling('5D', on='date')['count_past_5_days']
.sum()
)
print(out)
user date
abc 2020-01-01 1.0
2020-01-02 2.0
2020-01-03 3.0
2020-01-04 4.0
2020-01-05 5.0
2020-01-06 5.0
2020-01-07 5.0
def 2020-01-01 1.0
2020-01-04 2.0
ghi 2020-01-04 1.0
Name: count_past_5_days, dtype: float64
This outputs a series whose values correspond to what you want. If you want your output to visually align with your input you can use any of the following...
out.sort_index(level='date').reset_index()
out.reset_index().sort_values('date')
out.reindex(pd.MultiIndex.from_frame(df).swaplevel()).reset_index()
Note that option 3 will preserve the original ordering of your data if it happens to not be sorted.
>>> out.sort_index(level='date').reset_index()
user date count_past_5_days
0 abc 2020-01-01 1.0
1 def 2020-01-01 1.0
2 abc 2020-01-02 2.0
3 abc 2020-01-03 3.0
4 abc 2020-01-04 4.0
5 def 2020-01-04 2.0
6 ghi 2020-01-04 1.0
7 abc 2020-01-05 5.0
8 abc 2020-01-06 5.0
9 abc 2020-01-07 5.0