I have a dataframe where the values in a column are strings which are often mispelled, something like:
col01 | col02 | name
-----------------------
--- | --- | mrk
--- | --- | anth3n7
--- | --- | j4ck
and a list of possible correct values for this column
possible_names = ['mark', 'anthony', 'jack']
I wanted to go over the elements of the 'name' column, and get the closest match from the list of correct names, so that I would end up with:
col01 | col02 | name | correct_name
-----------------------------------------
--- | --- | mrk | mark
--- | --- | anth3n7 | anthony
--- | --- | j4ck | jack
I tried using this code:
df_names = pd.read_csv('names/wrong.csv')
possible_names = ['mark', 'anthony', 'jack']
df_names['correct_name'] = difflib.get_close_matches(df_names['name'], possible_names)
but I get the error:
ValueError: Length of values (0) does not match length of index (3)
CodePudding user response:
Use:
import difflib
import pandas as pd
df = pd.DataFrame(data=["mrk", "anth3n7", "j4ck"], columns=["name"])
possible_names = ['mark', 'anthony', 'jack']
df["correct_name"] = df["name"].apply(lambda x: difflib.get_close_matches(x, possible_names)[0])
print(df)
Output
name correct_name
0 mrk mark
1 anth3n7 anthony
2 j4ck jack
As an alternative instead of .apply
, you could use a list comprehension:
df["correct_name"] = [difflib.get_close_matches(name, possible_names)[0] for name in df["name"]]
UPDATE
For the case of no match, one approach is to use:
possible_names = ['anthony', 'jack']
df["correct_name"] = [next(iter(difflib.get_close_matches(name, possible_names)), name) for name in df["name"]]
print(df)
Output
name correct_name
0 mrk mrk
1 anth3n7 anthony
2 j4ck jack
CodePudding user response:
Let us try
from fuzzywuzzy import process
df['new']=df.Name.apply(lambda x : [process.extract(x, possible_names, limit=1)][0][0][0])