Subtract one column by all values in another column only if condition is met-CodePudding

I have a dataframe that is a time series and I want to get the cumsum of differences between the CLOSE and SUBMISSION of an issue. However, I want it only to subtract if the CLOSE value is higher than the SUBMISSION value. Here are the data points (sorted by CLOSE), the expected output, and my attempted code:

df = pd.DataFrame({'REF_KEY': [1, 2, 3, 4, 5], 'SUBMISSION': ['2018-08-21', '2018-09-03', '2018-09-07', '2018-09-06', '2018-08-28'], 'CLOSE': ['2018-09-05', '2018-09-12', '2018-09-18', '2018-09-24', '2018-09-27']})
df['CLOSE'] = df['CLOSE'].astype('datetime64[ns]')
df['SUBMISSION'] = df['SUBMISSION'].astype('datetime64[ns]')

For REF_KEY == 1, ACCUM_DATE_DELTA should be the sum of:

15 day difference of ('2018-09-05' - '2018-08-21')
2 day difference between ('2018-09-05' - '09-03-2018')
8 day difference between ('2018-09-05' - '2018-08-28') making it 26

For REF_KEY == 2, you will get the sum of:

22 day difference between ('2018-09-12' - '2018-08-21')
9 day difference between ('2018-09-12' - '2018-09-03')
5 day difference between ('2018-09-12' - '2018-09-07')
6 day difference between ('2018-09-12' - '2018-09-06')
15 day difference between ('2018-09-12' - '2018-08-28')

So for REF_KEY == 1, you can see that the difference between its close date includes REF_KEY == [3, 4], and that is because the CLOSE is greater than SUBMISSION. Therefore, I had the idea of creating a condition where the CLOSE date has to be more than SUBMISSION date.

df_2 = pd.DataFrame({'REF_KEY': [1, 2, 3, 4, 5], 
                     'SUBMISSION': ['2018-08-21', '2018-09-03', '2018-09-07', '2018-09-06', '2018-08-28'], 'CLOSE': ['2018-09-05', '2018-09-12', '2018-09-18', '2018-09-24', '2018-09-27'], 'ACCUM_DATE_DELTA': [25, 57, 86, 116, 131]})
df_2['CLOSE'] = df['CLOSE'].astype('datetime64[ns]')
df_2['SUBMISSION'] = df['SUBMISSION'].astype('datetime64[ns]')

Attempted code:

df_2['ACCUM_DATE_DELTA'] = df_2['CLOSE']*len(df_2[df_2['CLOSE'] - df_2['SUBMISSION]]['SUBMISSION'].cumsum()) - df_2[df_2['CLOSE'] - df_2['SUBMISSION]]['SUBMISSION'].cumsum()

CodePudding user response：

I'm not whose calculations are correct. If mine are, let me know if this is what you want.

For 'REF_KEY' == 1, you can use this to find the cumulative sum of days

(df['CLOSE'][0] - df['SUBMISSION']).dt.days.clip(lower=0).sum()

df['CLOSE'][0] is first close date to calculate against all submission dates; dt.days gives days in integer

(df['CLOSE'][0] - df['SUBMISSION']).dt.days

0    15
1     2
2    -2
3    -1
4     8
Name: SUBMISSION, dtype: int64

Use clip(lower=0).sum() to change negative values to zero and sum

(df['CLOSE'][0] - df['SUBMISSION']).dt.days.clip(lower=0).sum()
result = 25

To automate this, use apply() with a custom function

def calc(x):
    # print((x - df['SUBMISSION']).dt.days.clip(lower=0).sum())
    return (x - df['SUBMISSION']).dt.days.clip(lower=0).sum()

df

   REF_KEY SUBMISSION      CLOSE
0        1 2018-08-21 2018-09-05
1        2 2018-09-03 2018-09-12
2        3 2018-09-07 2018-09-18
3        4 2018-09-06 2018-09-24
4        5 2018-08-28 2018-09-27

df['ACCUM_DATE_DELTA'] = df.apply(lambda x: calc(x['CLOSE']), axis=1)

   REF_KEY SUBMISSION      CLOSE  ACCUM_DATE_DELTA
0        1 2018-08-21 2018-09-05                25
1        2 2018-09-03 2018-09-12                57
2        3 2018-09-07 2018-09-18                87
3        4 2018-09-06 2018-09-24               117
4        5 2018-08-28 2018-09-27               132

CodePudding user response：

Cross-merge to generate their cartesian product of SUBMISSION x CLOSE
Keep only the rows where CLOSE > SUBMISSION
groupby the CLOSE dates and sum the group's CLOSE - SUBMISSION days
merge the ACCUM values back to the original df

m = pd.merge(df.SUBMISSION, df.CLOSE, how='cross') # cross-merge for all SUBMISSION x CLOSE combos

accum = (m.where(m.CLOSE > m.SUBMISSION)           # limit to CLOSE > SUBMISSION
          .groupby('CLOSE').SUBMISSION             # group by CLOSE
          .apply(lambda g: (g.name - g).sum())     # sum of all (CLOSE - SUBMISSION)
          .rename('ACCUM'))

df.merge(accum, on='CLOSE')                        # merge back to df

Output:

   REF_KEY  SUBMISSION       CLOSE     ACCUM
0        1  2018-08-21  2018-09-05   25 days
1        2  2018-09-03  2018-09-12   57 days
2        3  2018-09-07  2018-09-18   87 days
3        4  2018-09-06  2018-09-24  117 days
4        5  2018-08-28  2018-09-27  132 days

Notes:

how='cross' requires pandas 1.2.0 , so for earlier versions, merge on a dummy key column:

m = df[['SUBMISSION']].assign(key=0).merge(df[['CLOSE']].assign(key=0), on='key').drop(columns='key')

As with Jonathan's solution, a couple of these days are off by 1 compared to your output.