I have this data frame :
d = {'col1': [1, 2,0,55,12,1, 3,1,56,13], 'col2': [3,4,44,34,46,2,3,43,35,47], 'col3': ['A','A','A','B','B','A','B','B','B','B'] }
df = pd.DataFrame(data=d)
df
col1 col2 col3
0 1 3 A
1 2 4 A
2 0 44 A
3 55 34 B
4 12 46 B
5 1 2 A
6 3 3 B
7 1 43 B
8 56 35 B
9 13 47 B
The goal here is to have a data frame looking like this :
df
col1 col2 col3 label
0 1 3 A Nan
1 2 4 A Nan
2 0 44 A 1
3 55 34 B Nan
4 12 46 B Nan
5 1 2 A Nan
6 3 3 B Nan
7 1 43 B Nan
8 56 35 B 1
9 13 47 B Nan
we get the column label
by :
1- calculating the number of the occurrence of characters on col3
, i do this as follow :
s = df['col3'].ne(df['col3'].shift()).cumsum()
df['count'] = s.map(s.value_counts())
so I get this :
col1 col2 col3 count
0 1 3 A 3
1 2 4 A 3
2 0 44 A 3
3 55 34 B 2
4 12 46 B 2
5 1 2 A 1
6 3 3 B 4
7 1 43 B 4
8 56 35 B 4
9 13 47 B 4
the target is : I would like to create a new column label
where i should iterate on the data frame column count
, and when I find its value >= 3, the 3rd row of that 'sub group' in our case the : AAA then BB, A and finally BBBB should receive 1
as we have this :
df
col1 col2 col3 label
0 1 3 A Nan
1 2 4 A Nan
2 0 44 A 1
3 55 34 B Nan
4 12 46 B Nan
5 1 2 A Nan
6 3 3 B Nan
7 1 43 B Nan
8 56 35 B 1
9 13 47 B Nan
CodePudding user response:
I feel like you need cumcount
df.loc[s.groupby(s).cumcount()==2,'new']=1
df
Out[235]:
col1 col2 col3 new
0 1 3 A NaN
1 2 4 A NaN
2 0 44 A 1.0
3 55 34 B NaN
4 12 46 B NaN
5 1 2 A NaN
6 3 3 B NaN
7 1 43 B NaN
8 56 35 B 1.0
9 13 47 B NaN