How to create a name similarity column?-CodePudding

I've a sample dataframe

pid = [1,2,3,4,5]; name = ['abc', 'def', 'bca', 'fed', 'pqr']; match_score = [np.nan, np.nan, np.nan, np.nan, np.nan]

sample_df = pd.DataFrame(zip(pid,name,match_score), columns=['pid', 'name', 'match_score'])
sample_df

pid	name	match_score
1	abc	NaN
2	def	NaN
3	bca	NaN
4	fed	NaN
5	pqr	NaN

And there's a name similarity score calculation method

from difflib import SequenceMatcher
SequenceMatcher(None, "abc", "bca").ratio()
>>> 0.666

How can I apply SequenceMatcher method to each row in the sample_df, so that I get

from difflib import SequenceMatcher

# comparing row1 with row2
print(SequenceMatcher(None, "abc", "def").ratio())
>>> 0.0
# comparing row1 with row3
print(SequenceMatcher(None, "abc", "bca").ratio())
>>> 0.66
# comparing row1 with row4
print(SequenceMatcher(None, "abc", "fed").ratio())
>>> 0.0
# comparing row1 with row5
print(SequenceMatcher(None, "abc", "pqr").ratio())
>>> 0.0
# Highest score for abc was 6.666

pid	name	match_score
1	abc	0.666
2	def	NaN
3	bca	NaN
4	fed	NaN
5	pqr	NaN

# comparing row2 with row1
print(SequenceMatcher(None, "def", "abc").ratio())
>>> 0.0
# comparing row2 with row3
print(SequenceMatcher(None, "def", "bca").ratio())
>>> 0.0
# comparing row2 with row4
print(SequenceMatcher(None, "def", "fed").ratio())
>>> 0.33
# comparing row2 with row5
print(SequenceMatcher(None, "def", "pqr").ratio())
>>> 0.0
# Highest score for def was 3.333

pid	name	match_score
1	abc	0.666
2	def	0.33
3	bca	NaN
4	fed	NaN
5	pqr	NaN

And so on:

pid	name	match_score
1	abc	0.666
2	def	0.333
3	bca	0.666
4	fed	0.333
5	pqr	0.000

CodePudding user response：

managed to apply fuzzywuzzy to get the result you need.

I also just started, - so my approach is most likely not the best, but works for the data you presented:

import pandas as pd
import numpy as np
from fuzzywuzzy import process, fuzz

pid = [1,2,3,4,5]; name = ['abc', 'def', 'bca', 'fed', 'pqr']; match_score = [np.nan, np.nan, np.nan, np.nan, np.nan]

sample_df = pd.DataFrame(zip(pid,name,match_score), columns=['pid', 'name', 'match_score'])
sample_df.drop('match_score', axis=1, inplace=True) # droping col as it will be created later.

unique_names = sample_df['name'].unique().tolist()
match_score = [(x,)   i
             for x in unique_names 
             for i in process.extract(x, unique_names,     scorer=fuzz.token_sort_ratio)]
             
similarity_df = pd.DataFrame(match_score, columns=['name','name_compare','match_score'])
similarity_df = similarity_df[similarity_df['match_score'] !=0].copy()
similarity_df = similarity_df[similarity_df['match_score'] !=100].drop('name_compare', axis=1)
sample_df= sample_df.merge(similarity_df, left_on='name', right_on='name', how="outer")
sample_df.match_score = sample_df.match_score / 100
print(sample_df)

Out:

   pid name  match_score
0    1  abc         0.67
1    2  def         0.33
2    3  bca         0.67
3    4  fed         0.33
4    5  pqr          NaN

CodePudding user response：

I am running two loops: outer and inner. Sorry cannot comment. As I go on to comment code I get error for indentation in python.

I set a max value 0. I match it with ratio value calculated. I also match if the strings compared aren't same. If both check (string comparison and value against max) are true I assign it to the match_score column using loc

length_df=len(sample_df)
for outer_index in range(0, length_df):
    max=0
    for inner_index in range(0, length_df):
        out_value=sample_df.iloc[outer_index]['name']
        inn_value=sample_df.iloc[inner_index]['name']
        value_ratio=SequenceMatcher(None,out_value,inn_value).ratio()
        if (out_value!=inn_value) & (value_ratio >max):
            
            sample_df.loc[outer_index,'match_score']=value_ratio
            
            ##
        ##
    ##
##