I have a pandas dataframe like this:
first second third
0 2 2 False
1 3 1 True
2 1 4 False
3 0 6 False
4 5 7 True
5 4 2 False
6 3 4 False
7 3 6 True
and it could be created with the code:
import pandas as pd
df = pd.DataFrame(
{
'first': [2, 3, 1, 0, 5, 4, 3, 3],
'second': [2, 1, 4, 6, 7, 2, 4, 6],
'third': [False, True, False, False, True, False, False, True]
}
)
For any row with a True
value in the third column, I want to find the first row in the next rows which has a value in the second column greater than the value in the first column.
So the output should be:
first second third
2 1 4 False
6 3 4 False
And also it is my priority not to use any for-loop.
Have you any idea about this?
CodePudding user response:
You can try
m = df['third'].cumsum()
out = (df[m.gt(0) & (~df['third'])] # filter out heading False row and the middle True row
.groupby(m, as_index=False)
# select the first row that value in the second column greater than in the first column
.apply(lambda g: g[g['second'].gt(g['first'])].iloc[:1]))
print(out)
first second third
0 1 4 False
1 3 4 False
CodePudding user response:
One approach, using numpy.searchsorted
(for performance):
import numpy as np
# find all indices where first < second
m = df.index[df['first'] < df['second']]
# find all indices, but the last, where third is True
r = df.index[:-1][df.iloc[:-1]['third']]
# use searchsorted to find in O(logn) the next row where first < second
res = df.iloc[m[np.searchsorted(m, r, side="right")]]
print(res)
Output
first second third
2 1 4 False
6 3 4 False