I have the following dataframe of names and date of birth. The data type is all over the place, missing values and so on, like this one:
Employee type First Name Middle Name Last Name Date of Birth
Employee Paulo Cortez 01-01-90
Employee Paulo Ricardo Cortez 01-01-90
Employee Paulo Cortez 01-01-90
Employee Paulo Cortez 02-01-90
Employee Cortez
Employee Paulo Cortez $
Employee Maria ## Silva 02-01-90
Employee o,89 Pedro s
Employee Maria Silva
Employee Maria Silva 02-01-90
Employee Joao Augusto 02-01-90
Employee Maria Silva
I need a way to create an ID column, taking whatever value I have in the combination of names date of birth and create an unique ID per each of the rows like this one:
Employee type First Name Middle Name Last Name Date of Birth ID
Employee Paulo Cortez 01-01-90 10000
Employee Paulo Ricardo Cortez 01-01-90 10001
Employee Paulo Cortez 01-01-90 10000
Employee Paulo Cortez 02-01-90 10002
Employee Cortez 10003
Employee Paulo Cortez $ 10004
Employee Maria ## Silva 02-01-90 10005
Employee o,89 Pedro s 10006
Employee Maria Silva 10007
Employee Maria Silva 02-01-90 10008
Employee Joao Augusto 02-01-90 10009
Employee Maria Silva 10007
The ID should start in 10000 What would be the best way to create this column?
CodePudding user response:
Make sure, that you don't have duplicates and create an ID column:
df['ID'] = range(len(df))
CodePudding user response:
You can assign each group a unique id with pd.groupby.ngroup. One small issue with this method (Thanks to @milkwithfish for reminding) : It will assign -1 to each group if there is a null value in any of the groups. So, filling null values will be a safer method:
df.fillna(0, inplace=True)
df['id'] = df.groupby(['First Name', 'Last Name', 'Date of Birth']).ngroup() 10000