My DF currently has just the first two columns DATE and RESULT and I want to create the third column N_RESULTS_EQUAL_1_PAST_60_DAYS:
DATE RESULT N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS
2018-12-26 23:13:43 00:00 1 0
2019-02-18 23:27:58 00:00 0 1
2019-02-28 15:02:33 00:00 0 1
2019-03-05 18:30:26 00:00 1 2
2019-05-21 14:54:52 00:00 1 0
2019-08-26 14:30:38 00:00 1 0
2019-09-19 15:51:01 00:00 1 1
2019-12-16 17:58:24 00:00 0 0
2021-02-23 03:50:33 00:00 0 0
2021-08-08 22:26:01 00:00 1 0
2021-09-01 18:04:46 00:00 0 1
For each row I want to check all the previous rows that are within 60 days of the current row and sum up how many RESULT == 1 these previous rows have. I can only think in a double for loop to solve this problem, which is not efficient. If there a more efficient way to solve this problem?
CodePudding user response:
Assuming that 'DATE' is a DatetimeIndex, you can simply use .rolling() which now works for ragged datetimes:
df['N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS'] = df.rolling('60D').sum().astype('int')
however, I understand you don't want to include the 'RESULT' itself, only the sum from previous ones. In that case, just subtract it:
df['N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS'] = df['N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS'] - df['RESULT']
Output:
RESULT N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS
DATE
2018-12-26 23:13:43 00:00 1 0
2019-02-18 23:27:58 00:00 0 1
2019-02-28 15:02:33 00:00 0 0
2019-03-05 18:30:26 00:00 1 0
2019-05-21 14:54:52 00:00 1 0
2019-08-26 14:30:38 00:00 1 0
2019-09-19 15:51:01 00:00 1 1
2019-12-16 17:58:24 00:00 0 0
2021-02-23 03:50:33 00:00 0 0
2021-08-08 22:26:01 00:00 1 0
2021-09-01 18:04:46 00:00 0 1
The numbers don't add up to what you showed in the example, but they seem correct to me, according to your specification of previous 60 days.
CodePudding user response:
It seems @MethodGuy already described how to use rolling()
when I was working on solution but I put my version because I have something else.
And I also get the same result as @MethodGuy which are different then N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS
so I checked how many days is between first and last date in
rolling window`.
I'm not sure but maybe it should be 61D
(minus last date in function) to get past 60D
If you would have DATE
as index
then you could use rolling('60D')
to create rolling window
and work with only last 60 days - and then you can use .sum()
, .count()
, etc. You can also use .apply(func)
to run own function which can skip current date
def result(data):
return data[:-1].sum()
df['result'] = df['RESULT'].rolling('60D').apply(result).astype(int)
Minimal working code which shows .sum()
, .count()
, .apply()
to calculate sum without current day. I also use .apply()
to calculate days bettween first and last date in rolling window
text = '''DATE RESULT 60_DAYS
2018-12-26 23:13:43 00:00 1 0
2019-02-18 23:27:58 00:00 0 1
2019-02-28 15:02:33 00:00 0 1
2019-03-05 18:30:26 00:00 1 2
2019-05-21 14:54:52 00:00 1 0
2019-08-26 14:30:38 00:00 1 0
2019-09-19 15:51:01 00:00 1 1
2019-12-16 17:58:24 00:00 0 0
2021-02-23 03:50:33 00:00 0 0
2021-08-08 22:26:01 00:00 1 0
2021-09-01 18:04:46 00:00 0 1'''
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text), sep='\s{2,}')
df.index = pd.to_datetime(df['DATE'])
del df['DATE']
print(df)
def result1(data):
data = data[:-1]
return data.sum()
def result2(data):
data = data[:-1]
return len(data[ data == 1 ])
def days(data):
return (data.index[-1] - data.index[0]).days
window = df['RESULT'].rolling('60D')
df['sum'] = window.sum().astype(int)
df['count'] = window.count().astype(int)
df['result1'] = window.apply(result1).astype(int)
df['result2'] = window.apply(result2).astype(int)
df['days'] = window.apply(days).astype(int)
print(df)
Result:
RESULT 60_DAYS sum count result1 result2 days
DATE
2018-12-26 23:13:43 00:00 1 0 1 1 0 0 0
2019-02-18 23:27:58 00:00 0 1 1 2 1 1 54
2019-02-28 15:02:33 00:00 0 1 0 2 0 0 9
2019-03-05 18:30:26 00:00 1 2 1 3 0 0 14
2019-05-21 14:54:52 00:00 1 0 1 1 0 0 0
2019-08-26 14:30:38 00:00 1 0 1 1 0 0 0
2019-09-19 15:51:01 00:00 1 1 2 2 1 1 24
2019-12-16 17:58:24 00:00 0 0 0 1 0 0 0
2021-02-23 03:50:33 00:00 0 0 0 1 0 0 0
2021-08-08 22:26:01 00:00 1 0 1 1 0 0 0
2021-09-01 18:04:46 00:00 0 1 1 2 1 1 23