I was looking at this question: How can I find 5 consecutive rows in pandas Dataframe where a value of a certain column is at least 0.5, which is similar to the one I have in mind. I would like to find say at least 3 consecutive rows where a value is less than 0.5 (but not negative nor nan), while considering the entire dataframe and not just one column as in the question linked above. Here a facsimile dataframe:
from random import uniform
idx = pd.date_range("2018-01-01", periods=10, freq="M")
df = pd.DataFrame(
{
'A':[0, 0.4, 0.5, 0.3, 0,0,0,0,0,0],
'B':[0, 0.6, 0.8,0, 0.3, 0.3, 0.9, 0.7,0,0],
'C':[0,0,0.5, 0.4, 0.4, 0.2,0,0,0,0],
'D':[0.4,0, 0.6, 0.5, 0.7, 0.2,0, 0.9, 0.8,0],
'E':[0.4, 0.3, 0.2, 0.7, 0.7, 0.8,0,0,0,0],
'F':[0,0,0.6, 0.7,0.8, 0.3, 0.4, 0.1,0,0]
},
index=idx
)
df = df.replace({0:np.nan})
df
Hence, since columns B and D don't satisfy the criteria should be removed from the output.
I'd prefer not to use for loop and the like since it is a 2000-column df, therefore I tried with the following:
def consecutive_values_in_range(s, min, max):
return s.between(left=min, right=max)
min, max = 0, 0.5
df.apply(lambda col: consecutive_values_in_range(col, min, max), axis=0)
print(df)
But I didn't obtain what I was looking for, that would be something like this:
A C E F
2018-01-31 NaN NaN 0.4 NaN
2018-02-28 0.4 NaN 0.3 NaN
2018-03-31 0.5 0.5 0.2 0.6
2018-04-30 0.3 0.4 0.7 0.7
2018-05-31 NaN 0.4 0.7 0.8
2018-06-30 NaN 0.2 0.8 0.3
2018-07-31 NaN NaN NaN 0.4
2018-08-31 NaN NaN NaN 0.1
2018-09-30 NaN NaN NaN NaN
2018-10-31 NaN NaN NaN NaN
Any suggestions? Thanks in advance.
CodePudding user response:
lower, upper = 0, 0.5
n = 3
df.loc[:, ((df <= upper) & (df >= lower)).rolling(n).sum().eq(n).any()]
- get an is_between mask over
df
- get the rolling sum of these masks per column, window size being 3
- since True == 1 and False == 0, if we get 3 in any point, that implies consecutive 3 True's, i.e., 0 <= val <= 0.5 values in that column
- so check equality against 3 and see if there's any in a column
- lastly index with the resulting True/False mask per column
to get
A C E F
2018-01-31 NaN NaN 0.4 NaN
2018-02-28 0.4 NaN 0.3 NaN
2018-03-31 0.5 0.5 0.2 0.6
2018-04-30 0.3 0.4 0.7 0.7
2018-05-31 NaN 0.4 0.7 0.8
2018-06-30 NaN 0.2 0.8 0.3
2018-07-31 NaN NaN NaN 0.4
2018-08-31 NaN NaN NaN 0.1
2018-09-30 NaN NaN NaN NaN
2018-10-31 NaN NaN NaN NaN