Join/merge/concat 2 dataframes in pandas where the key columns are inconsistently spelled-CodePudding

I have two dataframes where there is only one key column, and it is people's full names. There are inconsistencies between the two columns and their spellings. For example, one name may have a letter missing, a name prefix like Mr. (where the other df does not), extra spaces, etc. I have double-checked that in both dataframes these columns are both object types/strings. I want to merge these two dataframes.

The code

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

df1["BestMatch"] = df1["FULL_NAME"].map(lambda x: process.extractOne(x,df2["FULL_NAME"])[0])

gives me the error

TypeError: expected string or bytes-like object

I have also tried

#import difflib 

#difflib.get_close_matches

df1['FULL_NAME'] = df1['FULL_NAME'].apply(lambda x: difflib.get_close_matches(x,df2['FULL_NAME'])[0])

gives me the error

IndexError: list index out of range

I've searched for how to fix these errors/the code, and nothing seems to quite give me the right answer. I am relatively inexperienced, and I'm guessing there is something I'm missing, but I'm not sure what it would be.

CodePudding user response：

You have NaN in the column FULL_NAME of df1.

Reproducible error:

df1 = pd.DataFrame({'FULL_NAME': ['Louis', np.nan, 'Alexandre']})
df2 = pd.DataFrame({'FULL_NAME': ['Mr Louis', 'Mr Paul', 'Mr Alexandre']})

>>> df1["FULL_NAME"].map(lambda x: process.extractOne(x, df2["FULL_NAME"])[0])
...
TypeError: expected string or bytes-like object

How to avoid this situation: dropna before apply/map.

df1['BestMatch'] = df1["FULL_NAME"].dropna().map(lambda x: process.extractOne(x,df2["FULL_NAME"])[0])

>>> df1
   FULL_NAME     BestMatch
0      Louis      Mr Louis
1        NaN           NaN
2  Alexandre  Mr Alexandre

I wrote a similar answer on how to merge to dataframe using fuzzywuzzy: https://stackoverflow.com/a/68945895/15239951

Debug Try to debug manually:

for x in df1['FULL_NAME']:
    try:
        b = process.extractOne(x, df2["FULL_NAME"])[0]
        print(f"{x} <-> {b}")
    except TypeError:
        print(f"XXX Problem with '{x}'")

Output:

Louis <-> Mr Louis
XXX Problem with 'nan'
Alexandre <-> Mr Alexandre

Merge Use index [2] instead of best match name [0]:

df1 = pd.DataFrame({'FULL_NAME': ['Louis', np.nan, 'Alexandre'], 
                    'DATA1': [10, 20, 30]})

df2 = pd.DataFrame({'FULL_NAME': ['Mr Louis', 'Mr Paul', 'Mr Alexandre'],
                    'DATA2': [11, 21, 31]})

best_match_index = lambda x: process.extractOne(x,df2["FULL_NAME"])[2]
df1['BestMatch'] = df1["FULL_NAME"].dropna().map(best_match_index)

out = df2.merge(df1, left_index=True, right_on='BestMatch', how='left')

Output

>>> out
      FULL_NAME_x  DATA2 FULL_NAME_y  DATA1  BestMatch
0.0      Mr Louis     11       Louis   10.0          0
NaN       Mr Paul     21         NaN    NaN          1
2.0  Mr Alexandre     31   Alexandre   30.0          2