Home > Blockchain >  Pandas DataFrame - get start and end datetime of longest consecutive duration that mathces condition
Pandas DataFrame - get start and end datetime of longest consecutive duration that mathces condition

Time:08-18

As in the title I try to get the two date times (starting and ending) from the longest consecutive duration that matches the condition. For example in a df1 that looks like this:

                    PlusOrMinus
2021-12-07 00:15:00 -13.5871
2021-12-07 00:30:00 -15.4942
2021-12-07 00:45:00 -20.1757
2021-12-07 01:00:00 -18.7693
2021-12-07 01:15:00 -20.3976
2021-12-07 01:30:00 -23.4302
2021-12-07 01:45:00 -27.7890
2021-12-07 02:00:00 -26.3100
2021-12-07 02:15:00 -24.8172
2021-12-07 02:30:00 -24.6888
2021-12-07 02:45:00 -25.6061
2021-12-07 03:00:00 -15.9453
2021-12-07 03:15:00 -16.2046
2021-12-07 03:30:00 -22.0985
2021-12-07 03:45:00 -31.5712
2021-12-07 04:00:00 -19.7103
2021-12-07 04:15:00 -21.9567
2021-12-07 04:30:00 -18.4576
2021-12-07 04:45:00 7.8169
2021-12-07 05:00:00 -3.1573
2021-12-07 05:15:00 -2.0727
2021-12-07 05:30:00 -9.9998
2021-12-07 05:45:00 -8.3122
2021-12-07 06:00:00 -21.1213
2021-12-07 06:15:00 -15.7010
2021-12-07 06:30:00 -11.9873
2021-12-07 06:45:00 -17.0063
2021-12-07 07:00:00 -19.1622
2021-12-07 07:15:00 -24.9770
2021-12-07 07:30:00 -17.0213
2021-12-07 07:45:00 -11.8568
2021-12-07 08:00:00 -14.5086
2021-12-07 08:15:00 -12.6211
2021-12-07 08:30:00 -15.1892
2021-12-07 08:45:00 -6.3195
2021-12-07 09:00:00 3.8588
2021-12-07 09:15:00 7.0303
2021-12-07 09:30:00 5.4663
2021-12-07 09:45:00 15.4749
2021-12-07 10:00:00 13.6401
2021-12-07 10:15:00 19.3944
2021-12-07 10:30:00 27.7750
2021-12-07 10:45:00 27.4708
2021-12-07 11:00:00 27.3326
2021-12-07 11:15:00 35.0471
2021-12-07 11:30:00 34.9592
2021-12-07 11:45:00 16.2484
2021-12-07 12:00:00 16.3074
2021-12-07 12:15:00 34.0840
2021-12-07 12:30:00 45.9264
2021-12-07 12:45:00 42.7552
2021-12-07 13:00:00 -12.7922
2021-12-07 13:15:00 -2.6878
2021-12-07 13:30:00 44.9213
2021-12-07 13:45:00 59.2340
2021-12-07 14:00:00 41.7356
2021-12-07 14:15:00 22.5459
2021-12-07 14:30:00 21.5620
2021-12-07 14:45:00 2.9136
2021-12-07 15:00:00 2.6749
2021-12-07 15:15:00 4.2463
2021-12-07 15:30:00 6.5702
2021-12-07 15:45:00 20.4180
2021-12-07 16:00:00 -12.0888
2021-12-07 16:15:00 -3.1891
2021-12-07 16:30:00 -0.1187
2021-12-07 16:45:00 -4.9805
2021-12-07 17:00:00 4.0196
2021-12-07 17:15:00 19.7059
2021-12-07 17:30:00 32.7101
2021-12-07 17:45:00 19.4074
2021-12-07 18:00:00 31.3227
2021-12-07 18:15:00 39.5703
2021-12-07 18:30:00 -0.8093
2021-12-07 18:45:00 -29.1126
2021-12-07 19:00:00 -48.4503
2021-12-07 19:15:00 -40.0563
2021-12-07 19:30:00 -12.6288
2021-12-07 19:45:00 15.3634
2021-12-07 20:00:00 29.0306
2021-12-07 20:15:00 12.8147
2021-12-07 20:30:00 13.9144
2021-12-07 20:45:00 20.3090
2021-12-07 21:00:00 10.9778
2021-12-07 21:15:00 -5.4246
2021-12-07 21:30:00 1.6816
2021-12-07 21:45:00 -5.2931
2021-12-07 22:00:00 -9.4514
2021-12-07 22:15:00 -23.1077
2021-12-07 22:30:00 -21.4703
2021-12-07 22:45:00 -8.2436
2021-12-07 23:00:00 2.4746
2021-12-07 23:15:00 0.9520
2021-12-07 23:30:00 3.7781
2021-12-07 23:45:00 13.9193
2021-12-08 00:00:00 9.6299

I want to get two separate variables that store the starting and ending DateTime of the longest consecutive duration where values from the PlusOrMinus column were less than zero (<0). Also, I want to get similar variables but for conditions where values from the PlusOrMinus column were greater than zero (>0). In the case of values that are smaller than 0 in df1 the starting var should be equal to 2021-12-07 00:15:00 and the ending var should be equal to 2021-12-07 04:30:00. Is it possible to achieve with 'pure' pandas? I will be grateful for any help.

CodePudding user response:

IIUC, you can do this:

# 1. make sure index has some name and is a DatetimeIndex
df.index = pd.to_datetime(df.index.rename('date'))

>>> df
                     PlusOrMinus
date                            
2021-12-07 00:15:00     -13.5871
2021-12-07 00:30:00     -15.4942
2021-12-07 00:45:00     -20.1757
...                          ...
2021-12-07 23:30:00       3.7781
2021-12-07 23:45:00      13.9193
2021-12-08 00:00:00       9.6299


# 2. make groups of consecutive same-sign
s = np.sign(df['PlusOrMinus']).astype(int)
z = df.assign(sign=s, grp=(s != s.shift()).cumsum())

>>> z
                     PlusOrMinus  sign  grp
date                                       
2021-12-07 00:15:00     -13.5871    -1    1
2021-12-07 00:30:00     -15.4942    -1    1
2021-12-07 00:45:00     -20.1757    -1    1
...                          ...   ...  ...
2021-12-07 23:30:00       3.7781     1   14
2021-12-07 23:45:00      13.9193     1   14
2021-12-08 00:00:00       9.6299     1   14

Now:

# 3. find duration, then longest for each sign
z2 = z.reset_index().groupby(['sign', 'grp'])['date'].agg([min, max])
z2 = z2.assign(duration=(z2['max'] - z2['min']))
idx = z2.groupby('sign')['duration'].idxmax()
result = z2.loc[idx]

>>> result
                         min                 max        duration
sign grp                                                        
-1   1   2021-12-07 00:15:00 2021-12-07 04:30:00 0 days 04:15:00
 1   4   2021-12-07 09:00:00 2021-12-07 12:45:00 0 days 03:45:00

If you want to further assign the dates min, max for the longest negative and positive runs to individual variables:

neg_start, neg_end = result.loc[-1][['min', 'max']].squeeze()
pos_start, pos_end = result.loc[1][['min', 'max']].squeeze()

print(f'longest negative run starts at {neg_start} and ends at {neg_end}')
print(f'longest positive run starts at {pos_start} and ends at {pos_end}')

# output
longest negative run starts at 2021-12-07 00:15:00 and ends at 2021-12-07 04:30:00
longest positive run starts at 2021-12-07 09:00:00 and ends at 2021-12-07 12:45:00
  • Related