I have this file in csv and I would like to create IDs by replacing the names with the IDs so they don't know who the people are. However I have this Name table below that as you can see has names that are repeated. I mean, where there are repeated names I substitute the same ID, and where there are different names I insert a new ID. I'd like to make it look like the IDs table I attached.
I think the idea is to take a record from the name column, read line by line comparing if there is a record equal to this one, if there is, repeat the id, if not, create a new id, adding these ids to a list and then adding this one list in a new column and removing the name column. I think this way, but I don't know how to do it How could I do this in Python? I have no idea
Table Name
Year,Name,Points,Sales Qtd,Sales date
2018,Mike,10,50,2018-01-01
2018,Joan,5,25,2018-01-02
2019,Joan,30,55,2019-01-05
2019,Micheal,10,35, 2019-03-15
2019,Micheal,5,20,2019-03-18
2019,Micheal,8,15,2019-04-20
2019,Elizabeth,20,50,2019-05-21
Tabe IDs
Year,NameID,Points,Sales qtd,sales date
2018,1,10,50,2018-01-01
2018,2,5,25,2018-01-02
2019,2,30,55,2019-01-05
2019,3,10,35,2019-03-15
2019,3,5,20,2019-03-18
2019,3,8,15,2019-04-20
2019,4,20,50,2019-05-21
I tried this code below with pandas but, it is not working
import pandas as pd
list_row = []
results = pd.read_csv('test_name.csv')
rowcount = 0
for row in open("test_name.csv"):
rowcount = 1
print("Number of lines present: ", rowcount)
first_row = results["Name"].loc[0]
print(first_row)
count = 0
count_reg = 0
count_row = 0
results = results.reset_index()
print(results["Name"].loc[count])
while count < rowcount:
print(results["Name"].loc[count], results["Name"].loc[count_reg])
for index, row in results.iterrows():
if results["Name"].loc[count] == results["Name"].loc[count_reg]:
count_reg = 1
list_row.append(count_row)
count_row = 1
else:
count_reg = 1
list_row.append(count_row)
count = 1
print("end")
CodePudding user response:
I would do it following way.
import itertools
import pandas as pd
df = pd.DataFrame({"name":["Mike","Joan","Joan","Micheal","Micheal","Micheal","Elizabeth"]})
replacements = dict(zip(df['name'].unique(),itertools.count(1)))
# replacements is {'Mike': 1, 'Joan': 2, 'Micheal': 3, 'Elizabeth': 4}
df["nameid"] = df["name"].replace(replacements)
print(df)
output:
name nameid
0 Mike 1
1 Joan 2
2 Joan 2
3 Micheal 3
4 Micheal 3
5 Micheal 3
6 Elizabeth 4
Explanation: I get unique
names then create dict
with values being subsequent number starting from 1
, then I use .replace
to get number corresponding to name. Note I limited data just for Name
for brevity sake.
CodePudding user response:
If you do not mind to have the numbers in a different order than you suggest (e.g. alphabetical; be aware of the implications of this assumption), you could do as such:
df['Name'] = df['Name'].rank(method='dense').astype(int)
print(df)
Output:
Year Name Points Sales Qtd Sales date
0 2018 4 10 50 2018-01-01
1 2018 2 5 25 2018-01-02
2 2019 2 30 55 2019-01-05
3 2019 3 10 35 2019-03-15
4 2019 3 5 20 2019-03-18
5 2019 3 8 15 2019-04-20
6 2019 1 20 50 2019-05-21
CodePudding user response:
You need to change the mindset from "processing line-by-line" to working with set-based data.
Try this:
# Find the unique names and assign them an ID
ids = (
results['Name'].drop_duplicates()
.to_frame()
.assign(NameID=lambda x: range(len(x)))
)
# Replace the original names with IDs
results = results.merge(ids, how="inner", on="Name").drop(columns='Name')