This is a follow-up question of this. I would like to perform a rolling window of the last n days but I want to filter out the latest x days from each window (x is smaller than n)
Here is an example:
d = {'Name': ['Jack', 'Jim', 'Jack', 'Jim', 'Jack', 'Jack', 'Jim', 'Jack', 'Jane', 'Jane'],
'Date': ['08/01/2021',
'27/01/2021',
'05/02/2021',
'10/02/2021',
'17/02/2021',
'18/02/2021',
'20/02/2021',
'21/02/2021',
'22/02/2021',
'29/03/2021'],
'Earning': [40, 10, 20, 20, 40, 50, 100, 70, 80, 90]}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df.Date, format='%d/%m/%Y')
df = df.sort_values('Date')
Name Date Earning
0 Jack 2021-01-08 40
1 Jim 2021-01-27 10
2 Jack 2021-02-05 20
3 Jim 2021-02-10 20
4 Jack 2021-02-17 40
5 Jack 2021-02-18 50
6 Jim 2021-02-20 100
7 Jack 2021-02-21 70
8 Jane 2021-02-22 80
9 Jane 2021-03-29 90
I would like to
- For each row, take the last
30 days
of the sameName
- call it a window - Remove the latest
20 days
of each window (i.e. only take the earliest 10 days) - Calculate the
sum
on theEarning
column
Expected outcome: (The two columns Window_From
and Window_To
are not needed. I only use them to demonstrate the mock data)
Name Date Earning Window_From Window_To Sum
0 Jack 2021-01-08 40 2020-12-09 2020-12-19 0.0
1 Jim 2021-01-27 10 2020-12-28 2021-01-07 0.0
2 Jack 2021-02-05 20 2021-01-06 2021-01-16 40.0
3 Jim 2021-02-10 20 2021-01-11 2021-01-21 0.0
4 Jack 2021-02-17 40 2021-01-18 2021-01-28 0.0
5 Jack 2021-02-18 50 2021-01-19 2021-01-29 0.0
6 Jim 2021-02-20 100 2021-01-21 2021-01-31 10.0
7 Jack 2021-02-21 70 2021-01-22 2021-02-01 0.0
8 Jane 2021-02-22 80 2021-01-23 2021-02-02 0.0
9 Jane 2021-03-29 90 2021-02-27 2021-03-09 0.0
CodePudding user response:
Easy solution
Calculate 30 days and 20 days rolling
sum
then subtract 30 day sum from 20 day sum to get the effective rolling
sum
for first 10 days
s1 = df.groupby('Name').rolling('30d', on='Date')['Earning'].sum()
s2 = df.groupby('Name').rolling('20d', on='Date')['Earning'].sum()
df.merge(s1.sub(s2).reset_index(name='sum'), how='left')
Name Date Earning sum
0 Jack 2021-01-08 40 0.0
1 Jim 2021-01-27 10 0.0
2 Jack 2021-02-05 20 40.0
3 Jim 2021-02-10 20 0.0
4 Jack 2021-02-17 40 0.0
5 Jack 2021-02-18 50 0.0
6 Jim 2021-02-20 100 10.0
7 Jack 2021-02-21 70 0.0
8 Jane 2021-02-22 80 0.0
9 Jane 2021-03-29 90 0.0
CodePudding user response:
An alternative to rolling (may be faster):
EDIT: actually slower with OP's dataset.
df['start'] = df['Date'] - pd.Timedelta(days=30)
df['end'] = df['start'] pd.Timedelta(days=10)
df = df.set_index(['Name', 'Date'])
df['Sum'] = [df.xs(n, level=0).loc[start:end, 'Earning'].sum()
for n, start, end in zip(df.index.get_level_values(0), df['start'], df['end'])]
print(df.reset_index().drop(columns=['start', 'end']))
Name Date Earning Sum
0 Jack 2021-01-08 40 0
1 Jim 2021-01-27 10 0
2 Jack 2021-02-05 20 40
3 Jim 2021-02-10 20 0
4 Jack 2021-02-17 40 0
5 Jack 2021-02-18 50 0
6 Jim 2021-02-20 100 10
7 Jack 2021-02-21 70 0
8 Jane 2021-02-22 80 0
9 Jane 2021-03-29 90 0