Home > Software engineering >  count sets of consecutive true values in a column
count sets of consecutive true values in a column

Time:12-15

Let's say that I have a dataframe as follow:

df = pd.DataFrame({'A':[1,1,1,1,1,0,0,1,1,0,1,1,1,1,1,0,0,0,0,0,1,1]})

Then, I convert it into a boolean form:

df.eq(1)
Out[213]: 
        A
0    True
1    True
2    True
3    True
4    True
5   False
6   False
7    True
8    True
9   False
10   True
11   True
12   True
13   True
14   True
15  False
16  False
17  False
18  False
19  False
20   True
21   True

What I want is to count consecutive sets of True values in the column. In this example, the output would be:

    df
Out[215]: 
    A  count
0   1    5.0
1   1    2.0
2   1    5.0
3   1    2.0
4   1    NaN
5   0    NaN
6   0    NaN
7   1    NaN
8   1    NaN
9   0    NaN
10  1    NaN
11  1    NaN
12  1    NaN
13  1    NaN
14  1    NaN
15  0    NaN
16  0    NaN
17  0    NaN
18  0    NaN
19  0    NaN
20  1    NaN
21  1    NaN

My progress has been by using tools as 'groupby' and 'cumsum' but honestly, I can not figure out how to solve it. Thanks in advance

CodePudding user response:

You can use df['A'].diff().ne(0).cumsum() to generate a grouper that will group each consecutive group of zeros/ones:

# A side-by-side comparison:
>>> pd.concat([df['A'], df['A'].diff().ne(0).cumsum()], axis=1)
    A  A
0   1  1
1   1  1
2   1  1
3   1  1
4   1  1
5   0  2
6   0  2
7   1  3
8   1  3
9   0  4
10  1  5
11  1  5
12  1  5
13  1  5
14  1  5
15  0  6
16  0  6
17  0  6
18  0  6
19  0  6
20  1  7
21  1  7

Thus, group by that grouper, calculate sums, replace zero with NaN dropna, and reset the index:

df['count'] = df.groupby(df['A'].diff().ne(0).cumsum()).sum().replace(0, np.nan).dropna().reset_index(drop=True)

Output:

>>> df
    A    B
0   1  5.0
1   1  2.0
2   1  5.0
3   1  2.0
4   1  NaN
5   0  NaN
6   0  NaN
7   1  NaN
8   1  NaN
9   0  NaN
10  1  NaN
11  1  NaN
12  1  NaN
13  1  NaN
14  1  NaN
15  0  NaN
16  0  NaN
17  0  NaN
18  0  NaN
19  0  NaN
20  1  NaN
21  1  NaN

CodePudding user response:

I propose an alternative way that makes use of the split string function.

Let's transform the Series df.A into a string and then split it where the zeros are.

df = pd.DataFrame({'A':[1,1,1,1,1,0,0,1,1,0,1,1,1,1,1,0,0,0,0,0,1,1]})
ll = ''.join(df.A.astype('str').tolist()).split('0')

The list ll looks like

print(ll)
['11111', '', '11', '11111', '', '', '', '', '11']

now we count the lengths of every string and put it into a list

[len(item) for item in ll if len(item)>0]

This is doable if the Series is not too long.

  • Related