Home > Blockchain >  Incrementing frequency in data,for a specific number using a Python dataframe?
Incrementing frequency in data,for a specific number using a Python dataframe?

Time:11-04

I have a dataset like:

a   
c   
c   
d   
b   
a   
a   
d   
d   
c
c   
b   
a   
b

I want to add a column that looks like the one below. When 'c' is reached, the new column will be zero and then be increased by one. Is there a way we can do this using python?

a   1
c   0
c   0
d   2
b   2
a   2
a   2
d   2
d   2
c   0
c   0
b   3
a   3
b   3

CodePudding user response:

Use:

s = df['col'].eq('c')
df['new'] = s.ne(s.shift())[~s].cumsum().reindex(df.index, fill_value=0)

s = df['col'].ne('c')
df['new'] = (s & ~s.shift(fill_value=False)).cumsum().where(s, 0)

print (df)
   col  new
0    a    1
1    c    0
2    c    0
3    d    2
4    b    2
5    a    2
6    a    2
7    d    2
8    d    2
9    c    0
10   c    0
11   b    3
12   a    3
13   b    3

Explanation:

First filter c (mask), get first values of consecutive groups by compare shifted values (consecutive), filter non c values (filtered) and create groups cumulative sum by Series.cumsum (cumsum), last add 0 by Series.reindex (out):

print (df.assign(mask = df['col'].eq('c'),
                 consecutive=s.ne(s.shift()),
                 filtered=s.ne(s.shift())[~s],
                 cumsum=s.ne(s.shift())[~s].cumsum(),
                 out=s.ne(s.shift())[~s].cumsum().reindex(df.index, fill_value=0)))

   col  new   mask  consecutive filtered  cumsum  out
0    a    1  False         True     True     1.0    1
1    c    0   True         True      NaN     NaN    0
2    c    0   True        False      NaN     NaN    0
3    d    2  False         True     True     2.0    2
4    b    2  False        False    False     2.0    2
5    a    2  False        False    False     2.0    2
6    a    2  False        False    False     2.0    2
7    d    2  False        False    False     2.0    2
8    d    2  False        False    False     2.0    2
9    c    0   True         True      NaN     NaN    0
10   c    0   True        False      NaN     NaN    0
11   b    3  False         True     True     3.0    3
12   a    3  False        False    False     3.0    3
13   b    3  False        False    False     3.0    3

Another idea is filter non c(mask), then get first Trues by no c group (irst_true_per_groups), add cumulative sum (cumsum) and last replace c rows by 0 in Series.where (out):

print (df.assign(mask = df['col'].ne('c'),
                  first_true_per_groups=s & ~s.shift(fill_value=False),
                  cumsum=(s & ~s.shift(fill_value=False)).cumsum(),
                  out=(s & ~s.shift(fill_value=False)).cumsum().mask(s, 0)))

   col   mask  first_true_per_groups  cumsum  out
0    a   True                   True       1    0
1    c  False                  False       1    1
2    c  False                  False       1    1
3    d   True                   True       2    0
4    b   True                  False       2    0
5    a   True                  False       2    0
6    a   True                  False       2    0
7    d   True                  False       2    0
8    d   True                  False       2    0
9    c  False                  False       2    2
10   c  False                  False       2    2
11   b   True                   True       3    0
12   a   True                  False       3    0
13   b   True                  False       3    0

CodePudding user response:

    df1.assign(col2=df1.col1.ne('c').astype(int))\
        .assign(col3=lambda dd:(dd.col2.diff()==1).cumsum())\
        .assign(col4=lambda dd:dd.col2*(dd.col2 dd.col3))
    
    
       col1  col2  col3  col4
    0     a     1     0     1
    1     c     0     0     0
    2     c     0     0     0
    3     d     1     1     2
    4     b     1     1     2
    5     a     1     1     2
    6     a     1     1     2
    7     d     1     1     2
    8     d     1     1     2
    9     c     0     1     0
    10    c     0     1     0
    11    b     1     2     3
    12    a     1     2     3
    13    b     1     2     3


or 

df1=df1.assign(col2=(df1.eq('c')&df1.shift().ne('c')).cumsum() 1)
df1.loc[df1.col1=='c']=0
df1
  • Related