I have two datasets: one includes customer's family details and another one includes classes associated with those customers. I would like to replace the name of customers with IDs for privacy reasons. An example of data is
dataset 1 (customer's family relationships)
Customer Relative Age Note
Amber Bryan Viola Walter 22 none
Amber Bryan Christopher Lyl 22 none
Viola Walter Stephan Said 43 xxx
Sion X. Martin Grey 64 none
dataset 2 (classes)
Customer Class Age
Amber Bryan 1 22
Viola Walter 2 43
Christopher Lyl -2 41
Stephan Said 1 42
Sion X. 0 64
Martin Grey 1 34
I would like to get the following datasets:
Customer Relative Age Note
1 2 22 none
1 3 22 none
2 4 43 xxx
5 6 64 none
and
Customer Class Age
1 1 22
2 2 43
3 -2 41
4 1 42
5 0 64
6 1 34
It would be good if the number of IDs would be given based on the list provided in the dataset 2.
I am thinking of creating an index column for the dataset 2 but I do not know how to use this information in dataset 1, also considering that I would need to assign IDs for both Customer and Relative.
CodePudding user response:
This is the input you have:
import pandas as pd
df1 = pd.DataFrame.from_dict({'Customer': {0: 'Amber Bryan',
1: 'Amber Bryan',
2: 'Viola Walter',
3: 'Sion X.'},
'Relative': {0: 'Viola Walter',
1: 'Christopher Lyl',
2: 'Stephan Said',
3: 'Martin Grey'},
'Age': {0: '22', 1: '22', 2: '43', 3: '64'},
'Note': {0: 'none', 1: 'none', 2: 'xxx', 3: 'none'}})
df2 = pd.DataFrame.from_dict({'Customer': {0: 'Amber Bryan 1',
1: 'Viola Walter 2',
2: 'Christopher Lyl',
3: 'Stephan Said',
4: 'Sion X.',
5: 'Martin Grey'},
'Class': {0: '22', 1: '43', 2: '-2', 3: '1', 4: '0', 5: '1'},
'Age ': {0: None, 1: None, 2: '41', 3: '42', 4: '64', 5: '34'}})
You can create a set contains all unique names and assign an id to that. You can enumerate on that or you can also assign random number or hash or etc. to each name and save that dictionary for future references.
name2id = {name:i 1 for i, name in enumerate(set(df1.Customer.to_list() df2.Customer.to_list()))}
df1['Customer'] = df1['Customer'].map(name2id)
df1['Relative'] = df1['Relative'].map(name2id)
df2['Customer'] = df2['Customer'].map(name2id)
Output:
# df1:
Customer Relative Age Note
0 1 4 22 none
1 1 2 22 none
2 4 7 43 xxx
3 5 3 64 none
# df2:
Customer Class Age
0 6 22 None
1 8 43 None
2 2 -2 41
3 7 1 42
4 5 0 64
5 3 1 34
CodePudding user response:
I think this should work, you can create a dictionary to asign unique identification numbers for each customers and then, in whatever dataframe you need, you just pull that customers id.
customersid = dict(zip(range(1, df['Customer'].unique().count() 1), df['Customer'].unique()))
df['Customer ID'] = customersid.get(df['Customer'])