Suppose I have a pandas dataframe like this:
first second third
1 2 2 1
2 2 1 0
3 3 4 5
4 4 6 3
5 5 4 3
6 8 8 4
7 3 4 2
8 5 6 6
and could be created with the code:
dataframe = pd.DataFrame(
{
'first': [2, 2, 3, 4, 5, 8, 3, 5],
'second': [2, 1, 4, 6, 4, 8, 4, 6],
'third': [1, 0, 5, 3, 3, 4, 2, 6]
}
)
I want to select the rows in which the value of the second column is more than the value of the first column and at the same time the values in the third column are less than the values in the second column for k consecutive rows where the last row of these k consecutive rows is exactly before the row in which the value of the second column is more than the value of the first column, and k could be any integer between 2 and 4 (closed interval).
So, the output should be rows:
3, 7, 8
To get the above-mentioned result using conditional row selection in pandas, I know I should write a code like this:
dataframe[(dataframe['first'] < dataframe['second']) & (second_condition)].index
But I don't know what to write for the second_condition which I have explained above. Can anyone help me with this?
CodePudding user response:
I will center my answer in the second part of your question. You need to use shift
function to compare. It allows you to shift rows.
Assuming your k
is fixed at 2, you should do something like this:
import pandas as pd
df = pd.DataFrame(
{
'first': [2, 2, 3, 4, 5, 8, 3, 5],
'second': [2, 1, 4, 6, 4, 8, 4, 6],
'third': [1, 0, 5, 3, 3, 4, 2, 6]
}
)
# this is the line
df[(df['third'] < df['second'].shift(1)) & (df['third'] < df['second'].shift(2))]
What's going on?
Start comparing 'third'
with previous value of 'second'
by shifting one row, and then shift it two places in a second condition.
Note this only works for fixed values of k
.
CodePudding user response:
The trick here is to calculate the rolling
sum
on a boolean mask to find out the number of values in k
previous rows where third
column is less than the second
column
k = 2
m1 = df['second'].gt(df['first'])
m2 = df['third'].lt(df['second']).shift(fill_value=0).rolling(k).sum().eq(k)
print(df[m1 & m2])
first second third
3 3 4 5
7 3 4 2
8 5 6 6