Home > Enterprise >  use fuzzywuzzy to compare 2 dataframe columns and change values based on matching ratio
use fuzzywuzzy to compare 2 dataframe columns and change values based on matching ratio

Time:02-28

I have 2 dataframes:

df1 = pd.DataFrame(columns=['name', 'age'], data=[['Sam', 20], ['Dani', 21]])
df2 = pd.DataFrame(columns=['name', 'age'], data=[['Sami', 25], ['Danielle', 22]])

I want to check if the name column in df1 exists in df2 using the fuzzywuzzy library and change it according to df2['name'] value.

Here is my code:

from fuzzywuzzy import fuzz
df1.loc[fuzz.ratio(df1['name'],df2['name']) >= 90, 'name'] = df2['name']

so basically if the matching ratio is equal or above 90 percent, I want to change Sam to Sami. I am getting this error: enter image description here

CodePudding user response:

If index of both DataFrames is same, is possible use:

m = [fuzz.ratio(a, b) >= 90 for a, b in zip(df1['name'],df2['name'])]
df1.loc[m, 'name'] = df2['name']

More general solution with any length is join both DatFrames by cross join, count ratio and create mapping Series by maximal ratio:

from fuzzywuzzy import fuzz

df = df1.merge(df2, how='cross')
df['r'] = df.apply(lambda x: fuzz.ratio(x['name_x'],x['name_y']), axis=1)

s = df[df['r'] >= 90].sort_values('r', ascending=False).drop_duplicates(subset=['name_x']).set_index('name_x')['name_y']

df1['name'] = df1['name'].map(s).fillna(df1['name'])

For improve performance use rapidfuzz:

#https://stackoverflow.com/a/63884750/2901002
from rapidfuzz import process, utils

actual_comp = []
similarity = []

company_mapping = {company: utils.default_process(company) for company in df2.name}

for customer in df1.name:
    _, score, comp = process.extractOne(
        utils.default_process(customer),
        company_mapping,
        processor=None)
    actual_comp.append(comp)
    similarity.append(score)
    
df1['new'] = pd.Series(actual_comp, index=df1.index)
df1['similarity'] = pd.Series(similarity, index=df1.index) 
        
#set values by similarity
df1.loc[df1['similarity'] >= 90, 'name'] = df1['new']

CodePudding user response:

You could try this. Using a method compare and applying it on column name

def compare(name_from_df1):
    for name in df2['name'].to_list():
        if fuzz.ratio(name_from_df1,name) >= 90:
            return name    
    return name_from_df1

df1['name'] = df1['name'].apply(lambda x: compare(x))
  • Related