Home > database >  Find max and min value in a df col between two dates contained in other cols
Find max and min value in a df col between two dates contained in other cols

Time:09-02

I have a pandas df (first 25 rows).

{'startDate': {Timestamp('2007-07-02 00:00:00'): NaT, Timestamp('2007-07-03 00:00:00'): NaT, Timestamp('2007-07-05 00:00:00'): Timestamp('2007-07-05 00:00:00'), Timestamp('2007-07-06 00:00:00'): NaT, Timestamp('2007-07-09 00:00:00'): NaT, Timestamp('2007-07-10 00:00:00'): NaT, Timestamp('2007-07-11 00:00:00'): NaT, Timestamp('2007-07-12 00:00:00'): Timestamp('2007-07-12 00:00:00'), Timestamp('2007-07-13 00:00:00'): NaT, Timestamp('2007-07-16 00:00:00'): NaT, Timestamp('2007-07-17 00:00:00'): Timestamp('2007-07-17 00:00:00'), Timestamp('2007-07-18 00:00:00'): NaT, Timestamp('2007-07-19 00:00:00'): NaT, Timestamp('2007-07-20 00:00:00'): NaT, Timestamp('2007-07-23 00:00:00'): NaT, Timestamp('2007-07-24 00:00:00'): NaT, Timestamp('2007-07-25 00:00:00'): NaT, Timestamp('2007-07-26 00:00:00'): NaT, Timestamp('2007-07-27 00:00:00'): NaT, Timestamp('2007-07-30 00:00:00'): NaT, Timestamp('2007-07-31 00:00:00'): NaT}, 'endDate': {Timestamp('2007-07-02 00:00:00'): NaT, Timestamp('2007-07-03 00:00:00'): NaT, Timestamp('2007-07-05 00:00:00'): Timestamp('2007-07-16 00:00:00'), Timestamp('2007-07-06 00:00:00'): NaT, Timestamp('2007-07-09 00:00:00'): NaT, Timestamp('2007-07-10 00:00:00'): NaT, Timestamp('2007-07-11 00:00:00'): NaT, Timestamp('2007-07-12 00:00:00'): Timestamp('2007-07-25 00:00:00'), Timestamp('2007-07-13 00:00:00'): NaT, Timestamp('2007-07-16 00:00:00'): NaT, Timestamp('2007-07-17 00:00:00'): Timestamp('2007-07-30 00:00:00'), Timestamp('2007-07-18 00:00:00'): NaT, Timestamp('2007-07-19 00:00:00'): NaT, Timestamp('2007-07-20 00:00:00'): NaT, Timestamp('2007-07-23 00:00:00'): NaT, Timestamp('2007-07-24 00:00:00'): NaT, Timestamp('2007-07-25 00:00:00'): NaT, Timestamp('2007-07-26 00:00:00'): NaT, Timestamp('2007-07-27 00:00:00'): NaT, Timestamp('2007-07-30 00:00:00'): NaT, Timestamp('2007-07-31 00:00:00'): NaT}, 'Value1': {Timestamp('2007-07-02 00:00:00'): 112.9156, Timestamp('2007-07-03 00:00:00'): 113.3248, Timestamp('2007-07-05 00:00:00'): 113.2057, Timestamp('2007-07-06 00:00:00'): 113.8009, Timestamp('2007-07-09 00:00:00'): 113.8157, Timestamp('2007-07-10 00:00:00'): 112.2684, Timestamp('2007-07-11 00:00:00'): 113.0644, Timestamp('2007-07-12 00:00:00'): 114.8497, Timestamp('2007-07-13 00:00:00'): 115.1919, Timestamp('2007-07-16 00:00:00'): 115.1771, Timestamp('2007-07-17 00:00:00'): 115.1175, Timestamp('2007-07-18 00:00:00'): 114.9093, Timestamp('2007-07-19 00:00:00'): 115.3556, Timestamp('2007-07-20 00:00:00'): 114.1877, Timestamp('2007-07-23 00:00:00'): 114.5373, Timestamp('2007-07-24 00:00:00'): 112.5511, Timestamp('2007-07-25 00:00:00'): 112.7817, Timestamp('2007-07-26 00:00:00'): 110.1111, Timestamp('2007-07-27 00:00:00'): 107.9464, Timestamp('2007-07-30 00:00:00'): 109.6351, Timestamp('2007-07-31 00:00:00'): 108.4002}}

The index, 'Date', is dt and is continuous daily data. I have two other dt cols, 'startDate' and 'endDate'. For each row in 'startDate' and 'endDate' that is != NaT I need to find the max and min of 'Value1' between startDate and endDate. Each 'startDate' and 'endDate' is in the index.

I cannot delete NaT's or I lose rows that could contain the max/min.

How can I do this?

enter image description here

CodePudding user response:

Try:

df.apply(lambda row: df.loc[row['startDate'] : row['endDate'], 'Value1'].max(), 
         axis=1)

For min - just replace max with min.

CodePudding user response:

I answered my own question. Apologize for all the text and formatting but maybe helpful to someone:

  1. Keep the event-based startDate and endDate data in a seperate df ('df2') from the df ('df') containing continuous data from which you want to search for max/min.

  2. Perform searchsorted and get the corresponding df2 values for both endpoints in df. s1 = dfT.reindex(np.searchsorted(df2['entryDate'], df['Date'], side='right')-1) s2 = dfT.reindex(np.searchsorted(df2['exitDate'], df['Date'], side='right')-1)
    This produces a unique index per event but spread across all rows in df.

  3. Drop indexes in s1 and s2 and concat to get both event indices together. s1 = s1.reset_index() s1 = pd.concat([s1, s2], axis=1)

  4. Determine where the indices are the same. These are dates in the continuous data outside of event startDate-endDate windows and should be removed. s1['sameIDX'] = np.where((s1['entryIdx'] == s1['exitIdx']), 1, 0) s1 = s1[(s1['sameIDX'] == 0)]

  5. Concat s1 with the continuous data (df).

  6. Groupby 'uniqueIdx' to get max/mins lows = s1.groupby('uniqueIdx').min()['Value1'] highs = s1.groupby('uniqueIdx').max()['Value1']

  7. Join with event df to add max/mins to events. df2 = pd.concat([df2, lows], axis=1) df2 = pd.concat([df2, highs], axis=1)

  • Related