I have two dataframes where there is only one key column, and it is people's full names. There are inconsistencies between the two columns and their spellings. For example, one name may have a letter missing, a name prefix like Mr. (where the other df does not), extra spaces, etc. I have double-checked that in both dataframes these columns are both object types/strings. I want to merge these two dataframes.
The code
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df1["BestMatch"] = df1["FULL_NAME"].map(lambda x: process.extractOne(x,df2["FULL_NAME"])[0])
gives me the error
TypeError: expected string or bytes-like object
I have also tried
#import difflib
#difflib.get_close_matches
df1['FULL_NAME'] = df1['FULL_NAME'].apply(lambda x: difflib.get_close_matches(x,df2['FULL_NAME'])[0])
gives me the error
IndexError: list index out of range
I've searched for how to fix these errors/the code, and nothing seems to quite give me the right answer. I am relatively inexperienced, and I'm guessing there is something I'm missing, but I'm not sure what it would be.
CodePudding user response:
You have NaN in the column FULL_NAME
of df1
.
Reproducible error:
df1 = pd.DataFrame({'FULL_NAME': ['Louis', np.nan, 'Alexandre']})
df2 = pd.DataFrame({'FULL_NAME': ['Mr Louis', 'Mr Paul', 'Mr Alexandre']})
>>> df1["FULL_NAME"].map(lambda x: process.extractOne(x, df2["FULL_NAME"])[0])
...
TypeError: expected string or bytes-like object
How to avoid this situation: dropna
before apply/map
.
df1['BestMatch'] = df1["FULL_NAME"].dropna().map(lambda x: process.extractOne(x,df2["FULL_NAME"])[0])
>>> df1
FULL_NAME BestMatch
0 Louis Mr Louis
1 NaN NaN
2 Alexandre Mr Alexandre
I wrote a similar answer on how to merge to dataframe using fuzzywuzzy
: https://stackoverflow.com/a/68945895/15239951
Debug Try to debug manually:
for x in df1['FULL_NAME']:
try:
b = process.extractOne(x, df2["FULL_NAME"])[0]
print(f"{x} <-> {b}")
except TypeError:
print(f"XXX Problem with '{x}'")
Output:
Louis <-> Mr Louis
XXX Problem with 'nan'
Alexandre <-> Mr Alexandre
Merge Use index [2]
instead of best match name [0]
:
df1 = pd.DataFrame({'FULL_NAME': ['Louis', np.nan, 'Alexandre'],
'DATA1': [10, 20, 30]})
df2 = pd.DataFrame({'FULL_NAME': ['Mr Louis', 'Mr Paul', 'Mr Alexandre'],
'DATA2': [11, 21, 31]})
best_match_index = lambda x: process.extractOne(x,df2["FULL_NAME"])[2]
df1['BestMatch'] = df1["FULL_NAME"].dropna().map(best_match_index)
out = df2.merge(df1, left_index=True, right_on='BestMatch', how='left')
Output
>>> out
FULL_NAME_x DATA2 FULL_NAME_y DATA1 BestMatch
0.0 Mr Louis 11 Louis 10.0 0
NaN Mr Paul 21 NaN NaN 1
2.0 Mr Alexandre 31 Alexandre 30.0 2