Count consecutive numbers from a column of a dataframe in Python-CodePudding

I have a dataframe that has segments of consecutive values appearing in column a (the value in column b does not matter):

import pandas as pd
import numpy as np
np.random.seed(150)
df = pd.DataFrame(data={'a':[1,2,3,4,5,15,16,17,18,203,204,205],'b':np.random.randint(50000,size=(12))})
>>> df
      a      b
0     1  27066
1     2  28155
2     3  49177
3     4    496
4     5   2354
5    15  23292
6    16   9358
7    17  19036
8    18  29946
9   203  39785
10  204  15843
11  205  21917

I would like to add a column c whose values are sequential counts according to presenting consecutive values in column a, as shown below:

a   b       c
1   27066   1
2   28155   2
3   49177   3
4   496     4
5   2354    5
15  23292   1
16  9358    2
17  19036   3
18  29946   4
203 39785   1
204 15843   2
205 21917   3

How to do this?

CodePudding user response：

One solution:

df["c"] = (s := df["a"] - np.arange(len(df))).groupby(s).cumcount()   1
print(df)

Output

      a      b  c
0     1  27066  1
1     2  28155  2
2     3  49177  3
3     4    496  4
4     5   2354  5
5    15  23292  1
6    16   9358  2
7    17  19036  3
8    18  29946  4
9   203  39785  1
10  204  15843  2
11  205  21917  3

The original idea comes from ancient Python docs.

In order to use the walrus operator ((:=) or assignment expressions) you need Python 3.8 , instead you can do:

s = df["a"] - np.arange(len(df))
df["c"] = s.groupby(s).cumcount()   1
print(df)

CodePudding user response：

A simple solution is to find consecutive groups, use cumsum to get the number sequence and then remove any extra in later groups.

a = df['a'].add(1).shift(1).eq(df['a'])
df['c'] = a.cumsum() - a.cumsum().where(~a).ffill().fillna(0).astype(int)   1
df

Result:

      a      b  c
0     1  27066  1
1     2  28155  2
2     3  49177  3
3     4    496  4
4     5   2354  5
5    15  23292  1
6    16   9358  2
7    17  19036  3
8    18  29946  4
9   203  39785  1
10  204  15843  2
11  205  21917  3