I have a pandas DataFrame like this
import pandas as pd
import numpy as np
data = {
'col1': [0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, np.nan, np.nan, np.nan],
'col2': [1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0],
'col3': [1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0]
}
df = pd.DataFrame(data)
print(df)
# col1 col2 col3
# 0 0.0 1.0 1.0
# 1 1.0 1.0 1.0
# 2 1.0 1.0 0.0
# 3 1.0 1.0 1.0
# 4 1.0 1.0 1.0
# 5 0.0 0.0 0.0
# 6 1.0 0.0 0.0
# 7 1.0 0.0 0.0
# 8 0.0 0.0 0.0
# 9 0.0 1.0 1.0
# 10 0.0 1.0 1.0
# 11 NaN 1.0 0.0
# 12 NaN 1.0 1.0
# 13 NaN 1.0 1.0
How can I find the columns that have 4 or more consecutive 1.0
?
In my example, col1
and col2
are what I want to find.
Because df['col1'][1:5]
contains 4 1.0
s and df['col2'][:5]
contains 5 1.0
s.
CodePudding user response:
If large DataFrame you can avoid groupby
and use this solution for count consecutive 1
values, last filter columns names:
m = df.eq(1)
b = m.cumsum()
s = b.sub(b.mask(m).ffill().fillna(0)).ge(4).any()
print (s)
col1 True
col2 True
col3 False
dtype: bool
out = s.index[s].tolist()
print (out)
['col1', 'col2']
Or:
df1 = df.loc[:, s]
print (df1)
col1 col2
0 0.0 1.0
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
4 1.0 1.0
5 0.0 0.0
6 1.0 0.0
7 1.0 0.0
8 0.0 0.0
9 0.0 1.0
10 0.0 1.0
11 NaN 1.0
12 NaN 1.0
13 NaN 1.0
Details:
print (b.sub(b.mask(m).ffill().fillna(0)))
col1 col2 col3
0 0.0 1.0 1.0
1 1.0 2.0 2.0
2 2.0 3.0 0.0
3 3.0 4.0 1.0
4 4.0 5.0 2.0
5 0.0 0.0 0.0
6 1.0 0.0 0.0
7 2.0 0.0 0.0
8 0.0 0.0 0.0
9 0.0 1.0 1.0
10 0.0 2.0 2.0
11 0.0 3.0 0.0
12 0.0 4.0 1.0
13 0.0 5.0 2.0
CodePudding user response:
I would use a custom function to determine the max number of consecutive values:
def max_consecutive(s):
# group consecutive 1s and get the max size
return s.groupby(s.ne(1).cumsum()).size().max()-1
df.loc[:, df.apply(max_consecutive).ge(4)]
output:
col1 col2
0 0.0 1.0
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
4 1.0 1.0
5 0.0 0.0
6 1.0 0.0
7 1.0 0.0
8 0.0 0.0
9 0.0 1.0
10 0.0 1.0
11 NaN 1.0
12 NaN 1.0
13 NaN 1.0
If you just want the indices:
m = df.apply(max_consecutive).ge(4)
m[m].index
output:
['col1', 'col2']
Intermediate:
df.apply(max_consecutive)
output:
col1 4
col2 5
col3 2
dtype: int64