Python reading *.csv. How to create ID's based on a name column. If name is the same, then repe-CodePudding

I have this file in csv and I would like to create IDs by replacing the names with the IDs so they don't know who the people are. However I have this Name table below that as you can see has names that are repeated. I mean, where there are repeated names I substitute the same ID, and where there are different names I insert a new ID. I'd like to make it look like the IDs table I attached.

I think the idea is to take a record from the name column, read line by line comparing if there is a record equal to this one, if there is, repeat the id, if not, create a new id, adding these ids to a list and then adding this one list in a new column and removing the name column. I think this way, but I don't know how to do it How could I do this in Python? I have no idea

Table Name              
Year,Name,Points,Sales Qtd,Sales date
2018,Mike,10,50,2018-01-01
2018,Joan,5,25,2018-01-02
2019,Joan,30,55,2019-01-05
2019,Micheal,10,35, 2019-03-15
2019,Micheal,5,20,2019-03-18
2019,Micheal,8,15,2019-04-20
2019,Elizabeth,20,50,2019-05-21

Table Name

Tabe IDs                
Year,NameID,Points,Sales qtd,sales date
2018,1,10,50,2018-01-01
2018,2,5,25,2018-01-02
2019,2,30,55,2019-01-05
2019,3,10,35,2019-03-15
2019,3,5,20,2019-03-18
2019,3,8,15,2019-04-20
2019,4,20,50,2019-05-21

Table IDs

I tried this code below with pandas but, it is not working

import pandas as pd

list_row = []

results = pd.read_csv('test_name.csv')
rowcount = 0
for row in open("test_name.csv"):
    rowcount  = 1
print("Number of lines present: ", rowcount)

first_row = results["Name"].loc[0]
print(first_row)

count = 0
count_reg = 0
count_row = 0

results = results.reset_index()
print(results["Name"].loc[count])
while count < rowcount:    
    print(results["Name"].loc[count], results["Name"].loc[count_reg])    
    for index, row in results.iterrows():
        if results["Name"].loc[count] == results["Name"].loc[count_reg]:            
            count_reg  = 1
            list_row.append(count_row)
            count_row  = 1            
        else:            
            count_reg  = 1
            list_row.append(count_row)            
    count  = 1    
print("end")

CodePudding user response：

I would do it following way.

import itertools
import pandas as pd
df = pd.DataFrame({"name":["Mike","Joan","Joan","Micheal","Micheal","Micheal","Elizabeth"]})
replacements = dict(zip(df['name'].unique(),itertools.count(1)))
# replacements is {'Mike': 1, 'Joan': 2, 'Micheal': 3, 'Elizabeth': 4}
df["nameid"] = df["name"].replace(replacements)
print(df)

output:

        name  nameid
0       Mike       1
1       Joan       2
2       Joan       2
3    Micheal       3
4    Micheal       3
5    Micheal       3
6  Elizabeth       4

Explanation: I get unique names then create dict with values being subsequent number starting from 1, then I use .replace to get number corresponding to name. Note I limited data just for Name for brevity sake.

CodePudding user response：

If you do not mind to have the numbers in a different order than you suggest (e.g. alphabetical; be aware of the implications of this assumption), you could do as such:

df['Name'] = df['Name'].rank(method='dense').astype(int)
print(df)

Output:

   Year  Name  Points  Sales Qtd   Sales date
0  2018     4      10         50   2018-01-01
1  2018     2       5         25   2018-01-02
2  2019     2      30         55   2019-01-05
3  2019     3      10         35   2019-03-15
4  2019     3       5         20   2019-03-18
5  2019     3       8         15   2019-04-20
6  2019     1      20         50   2019-05-21

CodePudding user response：

You need to change the mindset from "processing line-by-line" to working with set-based data.

Try this:

# Find the unique names and assign them an ID
ids = (
    results['Name'].drop_duplicates()
        .to_frame()
        .assign(NameID=lambda x: range(len(x)))
)

# Replace the original names with IDs
results = results.merge(ids, how="inner", on="Name").drop(columns='Name')