Home > Software engineering >  How to find the columns that contain consecutive values in a pandas DataFrame?
How to find the columns that contain consecutive values in a pandas DataFrame?

Time:09-29

I have a pandas DataFrame like this

import pandas as pd
import numpy as np

data = {
    'col1': [0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, np.nan, np.nan, np.nan],
    'col2': [1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0],
    'col3': [1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0]
}

df = pd.DataFrame(data)

print(df)
#     col1  col2  col3
# 0    0.0   1.0   1.0
# 1    1.0   1.0   1.0
# 2    1.0   1.0   0.0
# 3    1.0   1.0   1.0
# 4    1.0   1.0   1.0
# 5    0.0   0.0   0.0
# 6    1.0   0.0   0.0
# 7    1.0   0.0   0.0
# 8    0.0   0.0   0.0
# 9    0.0   1.0   1.0
# 10   0.0   1.0   1.0
# 11   NaN   1.0   0.0
# 12   NaN   1.0   1.0
# 13   NaN   1.0   1.0

How can I find the columns that have 4 or more consecutive 1.0?

In my example, col1 and col2 are what I want to find.

Because df['col1'][1:5] contains 4 1.0s and df['col2'][:5] contains 5 1.0s.

CodePudding user response:

If large DataFrame you can avoid groupby and use this solution for count consecutive 1 values, last filter columns names:

m = df.eq(1)
b = m.cumsum()
s = b.sub(b.mask(m).ffill().fillna(0)).ge(4).any()
print (s)
col1     True
col2     True
col3    False
dtype: bool


out = s.index[s].tolist()
print (out)
['col1', 'col2']

Or:

df1 = df.loc[:, s]
print (df1)
    col1  col2
0    0.0   1.0
1    1.0   1.0
2    1.0   1.0
3    1.0   1.0
4    1.0   1.0
5    0.0   0.0
6    1.0   0.0
7    1.0   0.0
8    0.0   0.0
9    0.0   1.0
10   0.0   1.0
11   NaN   1.0
12   NaN   1.0
13   NaN   1.0

Details:

print (b.sub(b.mask(m).ffill().fillna(0)))
    col1  col2  col3
0    0.0   1.0   1.0
1    1.0   2.0   2.0
2    2.0   3.0   0.0
3    3.0   4.0   1.0
4    4.0   5.0   2.0
5    0.0   0.0   0.0
6    1.0   0.0   0.0
7    2.0   0.0   0.0
8    0.0   0.0   0.0
9    0.0   1.0   1.0
10   0.0   2.0   2.0
11   0.0   3.0   0.0
12   0.0   4.0   1.0
13   0.0   5.0   2.0

CodePudding user response:

I would use a custom function to determine the max number of consecutive values:

def max_consecutive(s):
    # group consecutive 1s and get the max size
    return s.groupby(s.ne(1).cumsum()).size().max()-1

df.loc[:, df.apply(max_consecutive).ge(4)]

output:

    col1  col2
0    0.0   1.0
1    1.0   1.0
2    1.0   1.0
3    1.0   1.0
4    1.0   1.0
5    0.0   0.0
6    1.0   0.0
7    1.0   0.0
8    0.0   0.0
9    0.0   1.0
10   0.0   1.0
11   NaN   1.0
12   NaN   1.0
13   NaN   1.0

If you just want the indices:

m = df.apply(max_consecutive).ge(4)
m[m].index

output:

['col1', 'col2']

Intermediate:

df.apply(max_consecutive)

output:

col1    4
col2    5
col3    2
dtype: int64
  • Related