Home > front end >  Find consecutive values in rows in pandas Dataframe based on condition
Find consecutive values in rows in pandas Dataframe based on condition

Time:01-15

I was looking at this question: How can I find 5 consecutive rows in pandas Dataframe where a value of a certain column is at least 0.5, which is similar to the one I have in mind. I would like to find say at least 3 consecutive rows where a value is less than 0.5 (but not negative nor nan), while considering the entire dataframe and not just one column as in the question linked above. Here a facsimile dataframe:

from random import uniform

idx = pd.date_range("2018-01-01", periods=10, freq="M")

df = pd.DataFrame(
    {
        'A':[0, 0.4, 0.5, 0.3, 0,0,0,0,0,0],
        'B':[0, 0.6, 0.8,0, 0.3, 0.3, 0.9, 0.7,0,0],
        'C':[0,0,0.5, 0.4, 0.4, 0.2,0,0,0,0],
        'D':[0.4,0, 0.6, 0.5, 0.7, 0.2,0, 0.9, 0.8,0],
        'E':[0.4, 0.3, 0.2, 0.7, 0.7, 0.8,0,0,0,0],
        'F':[0,0,0.6, 0.7,0.8, 0.3, 0.4, 0.1,0,0]
    },
    index=idx
)

df = df.replace({0:np.nan})

df

Hence, since columns B and D don't satisfy the criteria should be removed from the output.

I'd prefer not to use for loop and the like since it is a 2000-column df, therefore I tried with the following:

def consecutive_values_in_range(s, min, max):

    return s.between(left=min, right=max)

min, max = 0, 0.5

df.apply(lambda col: consecutive_values_in_range(col, min, max), axis=0)

print(df)

But I didn't obtain what I was looking for, that would be something like this:

            A     C   E   F
2018-01-31  NaN NaN 0.4 NaN
2018-02-28  0.4 NaN 0.3 NaN
2018-03-31  0.5 0.5 0.2 0.6
2018-04-30  0.3 0.4 0.7 0.7
2018-05-31  NaN 0.4 0.7 0.8
2018-06-30  NaN 0.2 0.8 0.3
2018-07-31  NaN NaN NaN 0.4
2018-08-31  NaN NaN NaN 0.1
2018-09-30  NaN NaN NaN NaN
2018-10-31  NaN NaN NaN NaN

Any suggestions? Thanks in advance.

CodePudding user response:

lower, upper = 0, 0.5
n = 3
df.loc[:, ((df <= upper) & (df >= lower)).rolling(n).sum().eq(n).any()]
  • get an is_between mask over df
  • get the rolling sum of these masks per column, window size being 3
  • since True == 1 and False == 0, if we get 3 in any point, that implies consecutive 3 True's, i.e., 0 <= val <= 0.5 values in that column
  • so check equality against 3 and see if there's any in a column
  • lastly index with the resulting True/False mask per column

to get

              A    C    E    F
2018-01-31  NaN  NaN  0.4  NaN
2018-02-28  0.4  NaN  0.3  NaN
2018-03-31  0.5  0.5  0.2  0.6
2018-04-30  0.3  0.4  0.7  0.7
2018-05-31  NaN  0.4  0.7  0.8
2018-06-30  NaN  0.2  0.8  0.3
2018-07-31  NaN  NaN  NaN  0.4
2018-08-31  NaN  NaN  NaN  0.1
2018-09-30  NaN  NaN  NaN  NaN
2018-10-31  NaN  NaN  NaN  NaN
  • Related