I have a dataframe with the following 2 columns, the employee type, name, the column that identify the primary contract and its ID number. Like this one:
Name Primary row? Employee Type ID
Paulo Cortez Yes Employee 100000
Paulo Cortez No Employee 100000
Joan San Yes Non-employee 100001
Felipe Castro Yes Contractor 100002
Felipe Castro No Employee 100002
Felipe Castro No Contractor 100002
I need to create a sub ID column that takes the ID value and adds the first digit of the employee type in front (that may be Employee, Non-employee and Contractor). If the ID appears more than once, it needs to check the "Primary row?" column. If it says "Yes", just leave as the same format and for the othes that have "No" on it add a tag of "-2", "-3", etc as the following:
Name Primary row? Employee Type ID sub ID
Paulo Cortez Yes Employee 100000 E100000
Paulo Cortez No Employee 100000 E100000-2
Joan San Yes Non-employee 100001 N100001
Felipe Castro Yes Contractor 100002 C100002
Felipe Castro No Employee 100002 E100002-2
Felipe Castro No Contractor 100002 E100002-3
What would be the best way to achieve this result?
CodePudding user response:
Here is one way to do it. First create a groupby with cumcount for the suffix if needed. Then apply each row and take add all the parts together.
df['sub_ID'] = df.groupby('ID').cumcount().add(1)
df['sub_ID'] = df.apply(lambda row:
row['Employee Type'][0]
str(row['ID'])
("" if row['Primary row?']=="Yes" else "-" str(row['sub_ID']))
,axis=1)
Output df
:
Name Primary row? Employee Type ID sub_ID
0 Paulo Cortez Yes Employee 100000 E100000
1 Paulo Cortez No Employee 100000 E100000-2
2 Joan San Yes Non-employee 100001 N100001
3 Felipe Castro Yes Contractor 100002 C100002
4 Felipe Castro No Employee 100002 E100002-2
5 Felipe Castro No Contractor 100002 C100002-3