Checking the length of a part of a dataframe in conditional row selection in pandas-CodePudding

Suppose I have a pandas dataframe like this:

    first   second  third
1     2       2       1
2     2       1       0
3     3       4       5
4     4       6       3
5     5       4       3
6     8       8       4
7     3       4       2
8     5       6       6

and could be created with the code:

dataframe = pd.DataFrame(
    {
        'first': [2, 2, 3, 4, 5, 8, 3, 5], 
        'second': [2, 1, 4, 6, 4, 8, 4, 6], 
        'third': [1, 0, 5, 3, 3, 4, 2, 6]
    }
)

I want to select the rows in which the value of the second column is more than the value of the first column and at the same time the values in the third column are less than the values in the second column for k consecutive rows where the last row of these k consecutive rows is exactly before the row in which the value of the second column is more than the value of the first column, and k could be any integer between 2 and 4 (closed interval).

So, the output should be rows: 3, 7, 8

To get the above-mentioned result using conditional row selection in pandas, I know I should write a code like this:

dataframe[(dataframe['first'] < dataframe['second']) & (second_condition)].index

But I don't know what to write for the second_condition which I have explained above. Can anyone help me with this?

CodePudding user response：

I will center my answer in the second part of your question. You need to use shift function to compare. It allows you to shift rows.

Assuming your k is fixed at 2, you should do something like this:

import pandas as pd

df = pd.DataFrame(
    {
        'first': [2, 2, 3, 4, 5, 8, 3, 5], 
        'second': [2, 1, 4, 6, 4, 8, 4, 6], 
        'third': [1, 0, 5, 3, 3, 4, 2, 6]
    }
)

# this is the line
df[(df['third'] < df['second'].shift(1)) & (df['third'] < df['second'].shift(2))]

What's going on?

Start comparing 'third' with previous value of 'second' by shifting one row, and then shift it two places in a second condition.

Note this only works for fixed values of k.

CodePudding user response：

The trick here is to calculate the rolling sum on a boolean mask to find out the number of values in k previous rows where third column is less than the second column

k = 2
m1 = df['second'].gt(df['first'])
m2 = df['third'].lt(df['second']).shift(fill_value=0).rolling(k).sum().eq(k)

print(df[m1 & m2])

   first  second  third
3      3       4      5
7      3       4      2
8      5       6      6