Home > Enterprise >  Create new column based on how many rows, with condition based on another column, are within X days
Create new column based on how many rows, with condition based on another column, are within X days

Time:11-27

My DF currently has just the first two columns DATE and RESULT and I want to create the third column N_RESULTS_EQUAL_1_PAST_60_DAYS:

DATE                        RESULT   N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS
2018-12-26 23:13:43 00:00   1        0
2019-02-18 23:27:58 00:00   0        1
2019-02-28 15:02:33 00:00   0        1
2019-03-05 18:30:26 00:00   1        2
2019-05-21 14:54:52 00:00   1        0
2019-08-26 14:30:38 00:00   1        0
2019-09-19 15:51:01 00:00   1        1
2019-12-16 17:58:24 00:00   0        0
2021-02-23 03:50:33 00:00   0        0
2021-08-08 22:26:01 00:00   1        0
2021-09-01 18:04:46 00:00   0        1

For each row I want to check all the previous rows that are within 60 days of the current row and sum up how many RESULT == 1 these previous rows have. I can only think in a double for loop to solve this problem, which is not efficient. If there a more efficient way to solve this problem?

CodePudding user response:

Assuming that 'DATE' is a DatetimeIndex, you can simply use .rolling() which now works for ragged datetimes:

df['N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS'] = df.rolling('60D').sum().astype('int')

however, I understand you don't want to include the 'RESULT' itself, only the sum from previous ones. In that case, just subtract it:

df['N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS'] = df['N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS'] - df['RESULT']

Output:

                           RESULT  N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS
DATE                                                                 
2018-12-26 23:13:43 00:00       1                                   0
2019-02-18 23:27:58 00:00       0                                   1
2019-02-28 15:02:33 00:00       0                                   0
2019-03-05 18:30:26 00:00       1                                   0
2019-05-21 14:54:52 00:00       1                                   0
2019-08-26 14:30:38 00:00       1                                   0
2019-09-19 15:51:01 00:00       1                                   1
2019-12-16 17:58:24 00:00       0                                   0
2021-02-23 03:50:33 00:00       0                                   0
2021-08-08 22:26:01 00:00       1                                   0
2021-09-01 18:04:46 00:00       0                                   1

The numbers don't add up to what you showed in the example, but they seem correct to me, according to your specification of previous 60 days.

CodePudding user response:

It seems @MethodGuy already described how to use rolling() when I was working on solution but I put my version because I have something else.

And I also get the same result as @MethodGuy which are different then N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS so I checked how many days is between first and last date in rolling window`.

I'm not sure but maybe it should be 61D (minus last date in function) to get past 60D


If you would have DATE as index then you could use rolling('60D') to create rolling window and work with only last 60 days - and then you can use .sum(), .count(), etc. You can also use .apply(func) to run own function which can skip current date

def result(data):
    return data[:-1].sum()

df['result'] = df['RESULT'].rolling('60D').apply(result).astype(int)

Minimal working code which shows .sum(), .count(), .apply() to calculate sum without current day. I also use .apply() to calculate days bettween first and last date in rolling window

text = '''DATE                        RESULT   60_DAYS
2018-12-26 23:13:43 00:00   1        0
2019-02-18 23:27:58 00:00   0        1
2019-02-28 15:02:33 00:00   0        1
2019-03-05 18:30:26 00:00   1        2
2019-05-21 14:54:52 00:00   1        0
2019-08-26 14:30:38 00:00   1        0
2019-09-19 15:51:01 00:00   1        1
2019-12-16 17:58:24 00:00   0        0
2021-02-23 03:50:33 00:00   0        0
2021-08-08 22:26:01 00:00   1        0
2021-09-01 18:04:46 00:00   0        1'''

import pandas as pd

import io
df = pd.read_csv(io.StringIO(text), sep='\s{2,}')
df.index = pd.to_datetime(df['DATE'])
del df['DATE']

print(df)

def result1(data):
    data = data[:-1]
    return data.sum()

def result2(data):
    data = data[:-1]
    return len(data[ data == 1 ])

def days(data):
    return (data.index[-1] - data.index[0]).days

window = df['RESULT'].rolling('60D')

df['sum']     = window.sum().astype(int)
df['count']   = window.count().astype(int)
df['result1'] = window.apply(result1).astype(int)
df['result2'] = window.apply(result2).astype(int)
df['days']    = window.apply(days).astype(int)

print(df)

Result:

                           RESULT  60_DAYS  sum  count  result1  result2  days
DATE                                                                          
2018-12-26 23:13:43 00:00       1        0    1      1        0        0     0
2019-02-18 23:27:58 00:00       0        1    1      2        1        1    54
2019-02-28 15:02:33 00:00       0        1    0      2        0        0     9
2019-03-05 18:30:26 00:00       1        2    1      3        0        0    14
2019-05-21 14:54:52 00:00       1        0    1      1        0        0     0
2019-08-26 14:30:38 00:00       1        0    1      1        0        0     0
2019-09-19 15:51:01 00:00       1        1    2      2        1        1    24
2019-12-16 17:58:24 00:00       0        0    0      1        0        0     0
2021-02-23 03:50:33 00:00       0        0    0      1        0        0     0
2021-08-08 22:26:01 00:00       1        0    1      1        0        0     0
2021-09-01 18:04:46 00:00       0        1    1      2        1        1    23
  • Related