I'm currently cleaning a dataset that shows names and gifts received by each person.
Each row goes like this:
Name | Gift |
---|---|
Agustin Dellagiovanna | Chocolate |
Agustín Delalgiovanna | Furniture |
Agustín Dellagiovanna | Art |
As you can see in this example, these three rows represent the same person. But two of them have different typos. The same thing happens with a lot of names in the dataset.
I wanted to know if there is a way for me to find these variations of the same name and replace them with the correct spelling of the name.
For now my only idea is to find each variation after checking the list of unique values in that column, but this's very time consuming given that the dataset has 45954 rows.
Any ideas?
Thanks in advance!
CodePudding user response:
One way to find the distance between strings is using difflib.SequenceMatcher
, like in this snippet:
from difflib import SequenceMatcher
unique_names = df['Name'].unique()
threshold = 0.75
candidate_similar_names = []
for i, name1 in enumerate(unique_names[:-1]):
for name2 in unique_names[i 1:]:
similarity = SequenceMatcher(None, name1, name2).ratio()
if similarity > threshold:
candidate_similar_names.append((name1, name2, similarity))
candidate_similar_names
The lower the threshold parameter, the more distant the names will be.
CodePudding user response:
You can use the package fuzzywuzzy, as shown here by Alperen Cetin:
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.DataFrame({'Name': ['Agustin Dellagiovanna', 'Agustín Delalgiovanna', 'Agustín Dellagiovanna'],
'Gift': ['Chocolate', 'Furniture', 'Art']})
def func(input_list):
for i in range(len(input_list)):
for j in range(len(input_list)):
if i < j and fuzz.ratio(input_list[i], input_list[j]) >= 90:
input_list[i] = input_list[j]
return input_list
wanted = ['Agustin Dellagiovan']
df['Name'] = func(df['Name'].to_list() wanted)[: -len(wanted)]
This will rename all the similar items to the same. If you create your list of wanted names (all the names that are correct. I don't know how you are going to decide this), then all names should change to those (given they are similar enough).