Home > Software engineering >  Replacing names with IDs in two datasets
Replacing names with IDs in two datasets

Time:03-02

I have two datasets: one includes customer's family details and another one includes classes associated with those customers. I would like to replace the name of customers with IDs for privacy reasons. An example of data is

dataset 1 (customer's family relationships)

Customer       Relative        Age   Note
Amber Bryan   Viola Walter      22    none
Amber Bryan   Christopher Lyl   22    none
Viola Walter  Stephan Said      43    xxx
Sion X.       Martin Grey              64    none

dataset 2 (classes)

Customer    Class      Age 
Amber Bryan 1          22
Viola Walter 2         43
Christopher Lyl  -2    41
Stephan Said     1     42
Sion X.          0     64
Martin Grey     1      34

I would like to get the following datasets:

Customer       Relative  Age   Note
1                 2      22    none
1                 3      22    none
2                 4      43    xxx
5                 6      64    none

and

Customer    Class      Age 
    1         1         22
    2         2         43
    3        -2         41
    4         1         42
    5         0         64
    6         1         34

It would be good if the number of IDs would be given based on the list provided in the dataset 2.

I am thinking of creating an index column for the dataset 2 but I do not know how to use this information in dataset 1, also considering that I would need to assign IDs for both Customer and Relative.

CodePudding user response:

This is the input you have:

import pandas as pd
df1 = pd.DataFrame.from_dict({'Customer': {0: 'Amber Bryan',
  1: 'Amber Bryan',
  2: 'Viola Walter',
  3: 'Sion X.'},
 'Relative': {0: 'Viola Walter',
  1: 'Christopher Lyl',
  2: 'Stephan Said',
  3: 'Martin Grey'},
 'Age': {0: '22', 1: '22', 2: '43', 3: '64'},
 'Note': {0: 'none', 1: 'none', 2: 'xxx', 3: 'none'}})

df2 = pd.DataFrame.from_dict({'Customer': {0: 'Amber Bryan 1',
  1: 'Viola Walter 2',
  2: 'Christopher Lyl',
  3: 'Stephan Said',
  4: 'Sion X.',
  5: 'Martin Grey'},
 'Class': {0: '22', 1: '43', 2: '-2', 3: '1', 4: '0', 5: '1'},
 'Age ': {0: None, 1: None, 2: '41', 3: '42', 4: '64', 5: '34'}})

You can create a set contains all unique names and assign an id to that. You can enumerate on that or you can also assign random number or hash or etc. to each name and save that dictionary for future references.

name2id = {name:i 1 for i, name in enumerate(set(df1.Customer.to_list()   df2.Customer.to_list()))}
df1['Customer'] = df1['Customer'].map(name2id)
df1['Relative'] = df1['Relative'].map(name2id)
df2['Customer'] = df2['Customer'].map(name2id)

Output:

# df1:
   Customer  Relative Age  Note
0         1         4  22  none
1         1         2  22  none
2         4         7  43   xxx
3         5         3  64  none

# df2:
   Customer Class  Age 
0         6    22  None
1         8    43  None
2         2    -2    41
3         7     1    42
4         5     0    64
5         3     1    34

CodePudding user response:

I think this should work, you can create a dictionary to asign unique identification numbers for each customers and then, in whatever dataframe you need, you just pull that customers id.

customersid = dict(zip(range(1, df['Customer'].unique().count() 1), df['Customer'].unique()))
      
df['Customer ID'] = customersid.get(df['Customer'])
  • Related