I have a dataframe with employee names, each row representing their contract information, the type of employee of their contract, the contract ID and which contract is the primary one (if the person has 1 contract, it is always marked as "Yes", if they have more than 1 contract, 1 is marked as "Yes" and others as "No") like in this example:
Name Employee Type ID Primary Contract?
Paulo Employee 10 Yes
Paulo Contractor 10 No
Paulo Contractor 10 No
Paulo Employee 10 No
Paulo Employee 10 No
Pedro Employee 11 No
Pedro Contractor 11 Yes
Jonas Contractor 16 Yes
Jonas Contractor 16 No
Felipe Non Worker 15 Yes
Maria Employee 13 No
Maria Contractor 13 Yes
Maria Employee 13 No
I need to create a column named "Assigned ID" that follows this logic:
The assignment ID for the row marked with "Yes" in "Primary Contract?" will always be the first digit letter of the employee type the ID number
The assignment ID for the other rows would follow the same logic but, if that ID already exists, add a "-2" right after it (not starting from the "-1" but "-2") and subsequential increase (-3,-4,etc) for the duplicates (no specific order for the ones with "No" in "Primary Contract?").
If the first digit letter of the ID changes, it is not considered the same ID.
The output expected then would be:
Name Employee Type ID Primary Contract? Assignment ID
Paulo Employee 10 Yes E10
Paulo Contractor 10 No C10
Paulo Contractor 10 No C10-2
Paulo Employee 10 No E10-2
Paulo Employee 10 No E10-3
Pedro Employee 11 No E11
Pedro Contractor 11 Yes C11
Jonas Contractor 16 Yes C16
Jonas Contractor 16 No C16-2
Felipe Non Worker 15 Yes N15
Maria Employee 13 No E13
Maria Contractor 13 Yes C13
Maria Employee 13 No E13-2
Thank you so much for the support!
CodePudding user response:
Let's try with groupby().cumcount()
to enumerate the rows within each ID/Type
df['Assignment_ID'] = df['Employee Type'].str[0].add(df['ID'].astype(str))
# also can groupby(['Employee Type', 'ID'])
counts df.groupby('Assignment_ID').cumcount().add(1)
df['Assignment_ID'] = np.where(counts==1, '', '-' counts.astype(str))
print(df)
Output:
Name Employee Type ID Primary Contract? Assignment_ID
0 Paulo Employee 10 Yes E10
1 Paulo Contractor 10 No C10
2 Paulo Contractor 10 No C10-2
3 Paulo Employee 10 No E10-2
4 Paulo Employee 10 No E10-3
5 Pedro Employee 11 No E11
6 Pedro Contractor 11 Yes C11
7 Jonas Contractor 16 Yes C16
8 Jonas Contractor 16 No C16-2
9 Felipe Non Worker 15 Yes N15
10 Maria Employee 13 No E13
11 Maria Contractor 13 Yes C13
12 Maria Employee 13 No E13-2