Create Unique ID by combining First Name, Middle Name and Date of Birth-CodePudding

I have the following dataframe of names and date of birth. The data type is all over the place, missing values and so on, like this one:

Employee type   First Name  Middle Name Last Name   Date of Birth
Employee    Paulo       Cortez  01-01-90
Employee    Paulo   Ricardo Cortez  01-01-90
Employee    Paulo       Cortez  01-01-90
Employee    Paulo       Cortez  02-01-90
Employee            Cortez  
Employee    Paulo       Cortez  $
Employee    Maria   ##  Silva   02-01-90
Employee    o,89    Pedro       s
Employee    Maria       Silva   
Employee    Maria       Silva   02-01-90
Employee    Joao        Augusto 02-01-90
Employee    Maria       Silva

I need a way to create an ID column, taking whatever value I have in the combination of names date of birth and create an unique ID per each of the rows like this one:

Employee type   First Name  Middle Name Last Name   Date of Birth   ID
Employee    Paulo       Cortez  01-01-90    10000
Employee    Paulo   Ricardo Cortez  01-01-90    10001
Employee    Paulo       Cortez  01-01-90    10000
Employee    Paulo       Cortez  02-01-90    10002
Employee            Cortez      10003
Employee    Paulo       Cortez  $   10004
Employee    Maria   ##  Silva   02-01-90    10005
Employee    o,89    Pedro       s   10006
Employee    Maria       Silva       10007
Employee    Maria       Silva   02-01-90    10008
Employee    Joao        Augusto 02-01-90    10009
Employee    Maria       Silva       10007

The ID should start in 10000 What would be the best way to create this column?

CodePudding user response：

Make sure, that you don't have duplicates and create an ID column:

df['ID'] = range(len(df))

CodePudding user response：

You can assign each group a unique id with pd.groupby.ngroup. One small issue with this method (Thanks to @milkwithfish for reminding) : It will assign -1 to each group if there is a null value in any of the groups. So, filling null values will be a safer method:

df.fillna(0, inplace=True)
df['id'] = df.groupby(['First Name', 'Last Name', 'Date of Birth']).ngroup()   10000