I need your help/guidance with this particular problem. I am dealing with a dataframe like this:
datetime high low
1 2022-08-26 19:00:00 00:00 NaN 0.99564
6 2022-08-26 14:00:00 00:00 1.00902 NaN
9 2022-08-26 11:00:00 00:00 NaN 0.99860
10 2022-08-26 10:00:00 00:00 1.00238 NaN
14 2022-08-26 06:00:00 00:00 NaN 0.99466
17 2022-08-26 03:00:00 00:00 0.99748 NaN
22 2022-08-25 22:00:00 00:00 0.99772 NaN
25 2022-08-25 19:00:00 00:00 0.99790 NaN
27 2022-08-25 17:00:00 00:00 0.99752 NaN
28 2022-08-25 16:00:00 00:00 NaN 0.99492
30 2022-08-25 14:00:00 00:00 1.00006 NaN
31 2022-08-25 13:00:00 00:00 NaN 0.99555
38 2022-08-25 06:00:00 00:00 1.00336 NaN
40 2022-08-25 04:00:00 00:00 1.00088 NaN
43 2022-08-25 01:00:00 00:00 0.99929 NaN
44 2022-08-25 00:00:00 00:00 NaN 0.99632
47 2022-08-24 21:00:00 00:00 NaN 0.99636
Desired output look like this:
datetime high low
1 2022-08-26 19:00:00 00:00 NaN 0.99564
6 2022-08-26 14:00:00 00:00 1.00902 NaN
9 2022-08-26 11:00:00 00:00 NaN 0.99860
10 2022-08-26 10:00:00 00:00 1.00238 NaN
14 2022-08-26 06:00:00 00:00 NaN 0.99466
25 2022-08-25 19:00:00 00:00 0.99790 NaN
28 2022-08-25 16:00:00 00:00 NaN 0.99492
30 2022-08-25 14:00:00 00:00 1.00006 NaN
31 2022-08-25 13:00:00 00:00 NaN 0.99555
38 2022-08-25 06:00:00 00:00 1.00336 NaN
44 2022-08-25 00:00:00 00:00 NaN 0.99632
What I want is:
For each "patch" where "low" or "high" is NaN, I want to find the max/min value. So I can't just look for e.g. the min value where high = NaN. This would just return the min value of the whole dataframe. What I want is just the min value between two NaN values.
I was thinking about generating a list for each patch --> find the extreme --> generate a new dataframe afterwards. But this seems pretty "dirty". Is there a nice way, using pandas to solve this kind of problem?
I hope my problem is clear.
Thank you guys.
CodePudding user response:
df
###
id datetime high low
0 1 2022-08-26 19:00:00 00:00 NaN 0.99564
1 6 2022-08-26 14:00:00 00:00 1.00902 NaN
2 9 2022-08-26 11:00:00 00:00 NaN 0.99860
3 10 2022-08-26 10:00:00 00:00 1.00238 NaN
4 14 2022-08-26 06:00:00 00:00 NaN 0.99466
5 17 2022-08-26 03:00:00 00:00 0.99748 NaN
6 22 2022-08-25 22:00:00 00:00 0.99772 NaN
7 25 2022-08-25 19:00:00 00:00 0.99790 NaN
8 27 2022-08-25 17:00:00 00:00 0.99752 NaN
9 28 2022-08-25 16:00:00 00:00 NaN 0.99492
10 30 2022-08-25 14:00:00 00:00 1.00006 NaN
11 31 2022-08-25 13:00:00 00:00 NaN 0.99555
12 38 2022-08-25 06:00:00 00:00 1.00336 NaN
13 40 2022-08-25 04:00:00 00:00 1.00088 NaN
14 43 2022-08-25 01:00:00 00:00 0.99929 NaN
15 44 2022-08-25 00:00:00 00:00 NaN 0.99632
16 47 2022-08-24 21:00:00 00:00 NaN 0.99636
high_group = df['high'].isna().cumsum()
low_group = df['low'].isna().cumsum()
g_high = df.groupby(high_group)['high'].transform('max').ffill()
g_low = df.groupby(low_group)['low'].transform('min').ffill()
mask = (df['high'].notna() & df['high'].ne(g_high))
mask2 = (df['low'].notna() & df['low'].ne(g_low))
df[~(mask | mask2)]
###
id datetime high low
0 1 2022-08-26 19:00:00 00:00 NaN 0.99564
1 6 2022-08-26 14:00:00 00:00 1.00902 NaN
2 9 2022-08-26 11:00:00 00:00 NaN 0.99860
3 10 2022-08-26 10:00:00 00:00 1.00238 NaN
4 14 2022-08-26 06:00:00 00:00 NaN 0.99466
7 25 2022-08-25 19:00:00 00:00 0.99790 NaN
9 28 2022-08-25 16:00:00 00:00 NaN 0.99492
10 30 2022-08-25 14:00:00 00:00 1.00006 NaN
11 31 2022-08-25 13:00:00 00:00 NaN 0.99555
12 38 2022-08-25 06:00:00 00:00 1.00336 NaN
15 44 2022-08-25 00:00:00 00:00 NaN 0.99632