I have a dataframe that is a time series and I want to get the cumsum of differences between the CLOSE
and SUBMISSION
of an issue. However, I want it only to subtract if the CLOSE
value is higher than the SUBMISSION
value. Here are the data points (sorted by CLOSE
), the expected output, and my attempted code:
df = pd.DataFrame({'REF_KEY': [1, 2, 3, 4, 5], 'SUBMISSION': ['2018-08-21', '2018-09-03', '2018-09-07', '2018-09-06', '2018-08-28'], 'CLOSE': ['2018-09-05', '2018-09-12', '2018-09-18', '2018-09-24', '2018-09-27']})
df['CLOSE'] = df['CLOSE'].astype('datetime64[ns]')
df['SUBMISSION'] = df['SUBMISSION'].astype('datetime64[ns]')
For REF_KEY == 1
, ACCUM_DATE_DELTA
should be the sum of:
- 15 day difference of ('2018-09-05' - '2018-08-21')
- 2 day difference between ('2018-09-05' - '09-03-2018')
- 8 day difference between ('2018-09-05' - '2018-08-28') making it 26
For REF_KEY == 2
, you will get the sum of:
- 22 day difference between ('2018-09-12' - '2018-08-21')
- 9 day difference between ('2018-09-12' - '2018-09-03')
- 5 day difference between ('2018-09-12' - '2018-09-07')
- 6 day difference between ('2018-09-12' - '2018-09-06')
- 15 day difference between ('2018-09-12' - '2018-08-28')
So for REF_KEY == 1
, you can see that the difference between its close date includes REF_KEY == [3, 4]
, and that is because the CLOSE
is greater than SUBMISSION
. Therefore, I had the idea of creating a condition where the CLOSE
date has to be more than SUBMISSION
date.
df_2 = pd.DataFrame({'REF_KEY': [1, 2, 3, 4, 5],
'SUBMISSION': ['2018-08-21', '2018-09-03', '2018-09-07', '2018-09-06', '2018-08-28'], 'CLOSE': ['2018-09-05', '2018-09-12', '2018-09-18', '2018-09-24', '2018-09-27'], 'ACCUM_DATE_DELTA': [25, 57, 86, 116, 131]})
df_2['CLOSE'] = df['CLOSE'].astype('datetime64[ns]')
df_2['SUBMISSION'] = df['SUBMISSION'].astype('datetime64[ns]')
Attempted code:
df_2['ACCUM_DATE_DELTA'] = df_2['CLOSE']*len(df_2[df_2['CLOSE'] - df_2['SUBMISSION]]['SUBMISSION'].cumsum()) - df_2[df_2['CLOSE'] - df_2['SUBMISSION]]['SUBMISSION'].cumsum()
CodePudding user response:
I'm not whose calculations are correct. If mine are, let me know if this is what you want.
For 'REF_KEY' == 1, you can use this to find the cumulative sum of days
(df['CLOSE'][0] - df['SUBMISSION']).dt.days.clip(lower=0).sum()
df['CLOSE'][0]
is first close date to calculate against all submission dates; dt.days
gives days in integer
(df['CLOSE'][0] - df['SUBMISSION']).dt.days
0 15
1 2
2 -2
3 -1
4 8
Name: SUBMISSION, dtype: int64
Use clip(lower=0).sum()
to change negative values to zero and sum
(df['CLOSE'][0] - df['SUBMISSION']).dt.days.clip(lower=0).sum()
result = 25
To automate this, use apply()
with a custom function
def calc(x):
# print((x - df['SUBMISSION']).dt.days.clip(lower=0).sum())
return (x - df['SUBMISSION']).dt.days.clip(lower=0).sum()
df
REF_KEY SUBMISSION CLOSE
0 1 2018-08-21 2018-09-05
1 2 2018-09-03 2018-09-12
2 3 2018-09-07 2018-09-18
3 4 2018-09-06 2018-09-24
4 5 2018-08-28 2018-09-27
df['ACCUM_DATE_DELTA'] = df.apply(lambda x: calc(x['CLOSE']), axis=1)
REF_KEY SUBMISSION CLOSE ACCUM_DATE_DELTA
0 1 2018-08-21 2018-09-05 25
1 2 2018-09-03 2018-09-12 57
2 3 2018-09-07 2018-09-18 87
3 4 2018-09-06 2018-09-24 117
4 5 2018-08-28 2018-09-27 132
CodePudding user response:
- Cross-
merge
to generate their cartesian product ofSUBMISSION
xCLOSE
- Keep only the rows
where
CLOSE > SUBMISSION
groupby
theCLOSE
dates and sum the group'sCLOSE - SUBMISSION
daysmerge
theACCUM
values back to the original df
m = pd.merge(df.SUBMISSION, df.CLOSE, how='cross') # cross-merge for all SUBMISSION x CLOSE combos
accum = (m.where(m.CLOSE > m.SUBMISSION) # limit to CLOSE > SUBMISSION
.groupby('CLOSE').SUBMISSION # group by CLOSE
.apply(lambda g: (g.name - g).sum()) # sum of all (CLOSE - SUBMISSION)
.rename('ACCUM'))
df.merge(accum, on='CLOSE') # merge back to df
Output:
REF_KEY SUBMISSION CLOSE ACCUM
0 1 2018-08-21 2018-09-05 25 days
1 2 2018-09-03 2018-09-12 57 days
2 3 2018-09-07 2018-09-18 87 days
3 4 2018-09-06 2018-09-24 117 days
4 5 2018-08-28 2018-09-27 132 days
Notes:
how='cross'
requires pandas 1.2.0 , so for earlier versions,merge
on a dummykey
column:m = df[['SUBMISSION']].assign(key=0).merge(df[['CLOSE']].assign(key=0), on='key').drop(columns='key')
As with Jonathan's solution, a couple of these days are off by 1 compared to your output.