I have a dataset like:
a
c
c
d
b
a
a
d
d
c
c
b
a
b
I want to add a column that looks like the one below. When 'c' is reached, the new column will be zero and then be increased by one. Is there a way we can do this using python?
a 1
c 0
c 0
d 2
b 2
a 2
a 2
d 2
d 2
c 0
c 0
b 3
a 3
b 3
CodePudding user response:
Use:
s = df['col'].eq('c')
df['new'] = s.ne(s.shift())[~s].cumsum().reindex(df.index, fill_value=0)
s = df['col'].ne('c')
df['new'] = (s & ~s.shift(fill_value=False)).cumsum().where(s, 0)
print (df)
col new
0 a 1
1 c 0
2 c 0
3 d 2
4 b 2
5 a 2
6 a 2
7 d 2
8 d 2
9 c 0
10 c 0
11 b 3
12 a 3
13 b 3
Explanation:
First filter c
(mask), get first values of consecutive groups by compare shifted values (consecutive), filter non c
values (filtered) and create groups cumulative sum by Series.cumsum
(cumsum), last add 0 by Series.reindex
(out):
print (df.assign(mask = df['col'].eq('c'),
consecutive=s.ne(s.shift()),
filtered=s.ne(s.shift())[~s],
cumsum=s.ne(s.shift())[~s].cumsum(),
out=s.ne(s.shift())[~s].cumsum().reindex(df.index, fill_value=0)))
col new mask consecutive filtered cumsum out
0 a 1 False True True 1.0 1
1 c 0 True True NaN NaN 0
2 c 0 True False NaN NaN 0
3 d 2 False True True 2.0 2
4 b 2 False False False 2.0 2
5 a 2 False False False 2.0 2
6 a 2 False False False 2.0 2
7 d 2 False False False 2.0 2
8 d 2 False False False 2.0 2
9 c 0 True True NaN NaN 0
10 c 0 True False NaN NaN 0
11 b 3 False True True 3.0 3
12 a 3 False False False 3.0 3
13 b 3 False False False 3.0 3
Another idea is filter non c
(mask), then get first Trues by no c
group (irst_true_per_groups), add cumulative sum (cumsum) and last replace c
rows by 0
in Series.where
(out):
print (df.assign(mask = df['col'].ne('c'),
first_true_per_groups=s & ~s.shift(fill_value=False),
cumsum=(s & ~s.shift(fill_value=False)).cumsum(),
out=(s & ~s.shift(fill_value=False)).cumsum().mask(s, 0)))
col mask first_true_per_groups cumsum out
0 a True True 1 0
1 c False False 1 1
2 c False False 1 1
3 d True True 2 0
4 b True False 2 0
5 a True False 2 0
6 a True False 2 0
7 d True False 2 0
8 d True False 2 0
9 c False False 2 2
10 c False False 2 2
11 b True True 3 0
12 a True False 3 0
13 b True False 3 0
CodePudding user response:
df1.assign(col2=df1.col1.ne('c').astype(int))\
.assign(col3=lambda dd:(dd.col2.diff()==1).cumsum())\
.assign(col4=lambda dd:dd.col2*(dd.col2 dd.col3))
col1 col2 col3 col4
0 a 1 0 1
1 c 0 0 0
2 c 0 0 0
3 d 1 1 2
4 b 1 1 2
5 a 1 1 2
6 a 1 1 2
7 d 1 1 2
8 d 1 1 2
9 c 0 1 0
10 c 0 1 0
11 b 1 2 3
12 a 1 2 3
13 b 1 2 3
or
df1=df1.assign(col2=(df1.eq('c')&df1.shift().ne('c')).cumsum() 1)
df1.loc[df1.col1=='c']=0
df1