Get max/min value of batches of rows, where condition exists-CodePudding

I need your help/guidance with this particular problem. I am dealing with a dataframe like this:

                     datetime     high      low
1   2022-08-26 19:00:00 00:00      NaN  0.99564
6   2022-08-26 14:00:00 00:00  1.00902      NaN
9   2022-08-26 11:00:00 00:00      NaN  0.99860
10  2022-08-26 10:00:00 00:00  1.00238      NaN
14  2022-08-26 06:00:00 00:00      NaN  0.99466
17  2022-08-26 03:00:00 00:00  0.99748      NaN
22  2022-08-25 22:00:00 00:00  0.99772      NaN
25  2022-08-25 19:00:00 00:00  0.99790      NaN
27  2022-08-25 17:00:00 00:00  0.99752      NaN
28  2022-08-25 16:00:00 00:00      NaN  0.99492
30  2022-08-25 14:00:00 00:00  1.00006      NaN
31  2022-08-25 13:00:00 00:00      NaN  0.99555
38  2022-08-25 06:00:00 00:00  1.00336      NaN
40  2022-08-25 04:00:00 00:00  1.00088      NaN
43  2022-08-25 01:00:00 00:00  0.99929      NaN
44  2022-08-25 00:00:00 00:00      NaN  0.99632
47  2022-08-24 21:00:00 00:00      NaN  0.99636

Desired output look like this:

                     datetime     high      low
1   2022-08-26 19:00:00 00:00      NaN  0.99564
6   2022-08-26 14:00:00 00:00  1.00902      NaN
9   2022-08-26 11:00:00 00:00      NaN  0.99860
10  2022-08-26 10:00:00 00:00  1.00238      NaN
14  2022-08-26 06:00:00 00:00      NaN  0.99466
25  2022-08-25 19:00:00 00:00  0.99790      NaN
28  2022-08-25 16:00:00 00:00      NaN  0.99492
30  2022-08-25 14:00:00 00:00  1.00006      NaN
31  2022-08-25 13:00:00 00:00      NaN  0.99555
38  2022-08-25 06:00:00 00:00  1.00336      NaN
44  2022-08-25 00:00:00 00:00      NaN  0.99632

What I want is:

For each "patch" where "low" or "high" is NaN, I want to find the max/min value. So I can't just look for e.g. the min value where high = NaN. This would just return the min value of the whole dataframe. What I want is just the min value between two NaN values.

I was thinking about generating a list for each patch --> find the extreme --> generate a new dataframe afterwards. But this seems pretty "dirty". Is there a nice way, using pandas to solve this kind of problem?

I hope my problem is clear.

Thank you guys.

CodePudding user response：

df
###
    id                  datetime     high      low
0    1 2022-08-26 19:00:00 00:00      NaN  0.99564
1    6 2022-08-26 14:00:00 00:00  1.00902      NaN
2    9 2022-08-26 11:00:00 00:00      NaN  0.99860
3   10 2022-08-26 10:00:00 00:00  1.00238      NaN
4   14 2022-08-26 06:00:00 00:00      NaN  0.99466
5   17 2022-08-26 03:00:00 00:00  0.99748      NaN
6   22 2022-08-25 22:00:00 00:00  0.99772      NaN
7   25 2022-08-25 19:00:00 00:00  0.99790      NaN
8   27 2022-08-25 17:00:00 00:00  0.99752      NaN
9   28 2022-08-25 16:00:00 00:00      NaN  0.99492
10  30 2022-08-25 14:00:00 00:00  1.00006      NaN
11  31 2022-08-25 13:00:00 00:00      NaN  0.99555
12  38 2022-08-25 06:00:00 00:00  1.00336      NaN
13  40 2022-08-25 04:00:00 00:00  1.00088      NaN
14  43 2022-08-25 01:00:00 00:00  0.99929      NaN
15  44 2022-08-25 00:00:00 00:00      NaN  0.99632
16  47 2022-08-24 21:00:00 00:00      NaN  0.99636

high_group = df['high'].isna().cumsum()
low_group = df['low'].isna().cumsum()
g_high = df.groupby(high_group)['high'].transform('max').ffill()
g_low = df.groupby(low_group)['low'].transform('min').ffill()
mask = (df['high'].notna() & df['high'].ne(g_high))
mask2 = (df['low'].notna() & df['low'].ne(g_low))

df[~(mask | mask2)]
###
    id                  datetime     high      low
0    1 2022-08-26 19:00:00 00:00      NaN  0.99564
1    6 2022-08-26 14:00:00 00:00  1.00902      NaN
2    9 2022-08-26 11:00:00 00:00      NaN  0.99860
3   10 2022-08-26 10:00:00 00:00  1.00238      NaN
4   14 2022-08-26 06:00:00 00:00      NaN  0.99466
7   25 2022-08-25 19:00:00 00:00  0.99790      NaN
9   28 2022-08-25 16:00:00 00:00      NaN  0.99492
10  30 2022-08-25 14:00:00 00:00  1.00006      NaN
11  31 2022-08-25 13:00:00 00:00      NaN  0.99555
12  38 2022-08-25 06:00:00 00:00  1.00336      NaN
15  44 2022-08-25 00:00:00 00:00      NaN  0.99632