Home > Blockchain >  Create Unique ID by combining First Name, Middle Name and Date of Birth
Create Unique ID by combining First Name, Middle Name and Date of Birth

Time:08-22

I have the following dataframe of names and date of birth. The data type is all over the place, missing values and so on, like this one:

Employee type   First Name  Middle Name Last Name   Date of Birth
Employee    Paulo       Cortez  01-01-90
Employee    Paulo   Ricardo Cortez  01-01-90
Employee    Paulo       Cortez  01-01-90
Employee    Paulo       Cortez  02-01-90
Employee            Cortez  
Employee    Paulo       Cortez  $
Employee    Maria   ##  Silva   02-01-90
Employee    o,89    Pedro       s
Employee    Maria       Silva   
Employee    Maria       Silva   02-01-90
Employee    Joao        Augusto 02-01-90
Employee    Maria       Silva   

I need a way to create an ID column, taking whatever value I have in the combination of names date of birth and create an unique ID per each of the rows like this one:

Employee type   First Name  Middle Name Last Name   Date of Birth   ID
Employee    Paulo       Cortez  01-01-90    10000
Employee    Paulo   Ricardo Cortez  01-01-90    10001
Employee    Paulo       Cortez  01-01-90    10000
Employee    Paulo       Cortez  02-01-90    10002
Employee            Cortez      10003
Employee    Paulo       Cortez  $   10004
Employee    Maria   ##  Silva   02-01-90    10005
Employee    o,89    Pedro       s   10006
Employee    Maria       Silva       10007
Employee    Maria       Silva   02-01-90    10008
Employee    Joao        Augusto 02-01-90    10009
Employee    Maria       Silva       10007

The ID should start in 10000 What would be the best way to create this column?

CodePudding user response:

Make sure, that you don't have duplicates and create an ID column:

df['ID'] = range(len(df))

CodePudding user response:

You can assign each group a unique id with pd.groupby.ngroup. One small issue with this method (Thanks to @milkwithfish for reminding) : It will assign -1 to each group if there is a null value in any of the groups. So, filling null values will be a safer method:

df.fillna(0, inplace=True)
df['id'] = df.groupby(['First Name', 'Last Name', 'Date of Birth']).ngroup()   10000
  • Related