I have the following Dataframe
df = pd.DataFrame({'Category': {0: 'onboarding segment-confirmation-unexpected-input origin',
1: 'onboarding segment-confirmation-unexpected-input view',
2: 'product-availability cpf-request-unexpected-input origin',
3: 'product-availability postalcode-validation-true-unexpected-input origin',
4: 'product-availability postalcode-validation-true-unexpected-input view'},
'UserId': {0: 9090, 1: 4545, 2: 3266, 3: 2894, 4: 2772}})
What I want to do is to formulate a flag that checks if the string part that is different than the word "view" or "origin". Is equal to the previous value, if so maintain the flag if not increase the flag value.
Wanted result
df = pd.DataFrame({'Category': {0: 'onboarding segment-confirmation-unexpected-input origin',
1: 'onboarding segment-confirmation-unexpected-input view',
2: 'product-availability cpf-request-unexpected-input origin',
3: 'product-availability postalcode-validation-true-unexpected-input origin',
4: 'product-availability postalcode-validation-true-unexpected-input view'},
'UserId': {0: 9090, 1: 4545, 2: 3266, 3: 2894, 4: 2772},
'Flag':{0:'Flag_1',1:'Flag_1',2:'Flag_2',3:'Flag_3',4:'Flag_3'}})
What would be the way to do this? I tried to slice it and formulating a groupby but I am having a little difficulty on the increasing part.
CodePudding user response:
Assuming you want to consider the first 2 blocks or string (blocks beinf separated by spaces):
# get substrings, keep first 2 (can be changed)
df2 = df['Category'].str.split(expand=True).iloc[:, :2]
# start new group if any value is different from the previous row
group = df2.ne(df2.shift()).any(axis=1).cumsum()
# add flag
df['Flag'] = 'Flag_' group.astype(str)
output:
Category UserId Flag
0 onboarding segment-confirmation-unexpected-inp... 9090 Flag_1
1 onboarding segment-confirmation-unexpected-inp... 4545 Flag_1
2 product-availability cpf-request-unexpected-in... 3266 Flag_2
3 product-availability postalcode-validation-tru... 2894 Flag_3
4 product-availability postalcode-validation-tru... 2772 Flag_3
CodePudding user response:
This works for me :
df = pd.DataFrame({'Category': {0: 'onboarding segment-confirmation-unexpected-input origin',
1: 'onboarding segment-confirmation-unexpected-input view',
2: 'product-availability cpf-request-unexpected-input origin',
3: 'product-availability postalcode-validation-true-unexpected-input origin',
4: 'product-availability postalcode-validation-true-unexpected-input view'},
'UserId': {0: 9090, 1: 4545, 2: 3266, 3: 2894, 4: 2772}})
#I chose 40 but you can change it to fit your needs depending on the data
df['temp']=df['Category'].str[:40]
df['Flag'] = df.groupby(['temp'], sort=False).ngroup() 1
df['Flag'] ='Flag_' df['Flag'].astype(str)