How to find the string match between 2 Excel files and return the match percentage in python?-CodePudding

I have 2 Excel files which contains names as the only column:

File 1: file1.xlsx

Names
Vinay adkz
Sagarbhansali
Jeffery Jonas
Kiara Francis
Dominic


File 2: file2.xlsx

Names:
bhansali Sagar
Dominic
Jenny
adkzVinay

Sample Output:
I want to match the names in file 1 with names in file 2, and i am trying to get an output like the below :

Names         File2Matchname.  Match%
Vinay adkz.     adkzVinay.       98%
Sagarbhansali.  bhansali sagar   97%
Jeffery Jonas       NA            0%
Kiara Francis       NA            0%
Dominic          Dominic         100%

Is there any logic by which the above logic can be arrived in python ?

I tried to do this in Excel but vlookup doesn't help with match%. I know this is possible with python using cousine similarity but i am unable to get the logic in which the output can be arrived.

Any help would be much appreciated.

CodePudding user response：

You can use Pandas and use python's built-in difflib library which has a function called difflib.SequenceMatcher() function to find the longest common substring between the two names.

Example code:

import pandas as pd
import difflib

#For testing
dict_lists = {"Names":["Vinay adkz", "Shailesh", "Seema", "Nidhi","Ajitesh"]}
dict_lists2 = {"Names":["Ajitesh", "Vinay adkz", "Seema", "Nid"]}

# Excel to  dataframes
df1 = pd.DataFrame(dict_lists) #pd.read_excel('file1.xlsx')
df2 = pd.DataFrame(dict_lists2) #pd.read_excel('file2.xlsx')

# Empty lists to stor matched name, match percentage
match_name = []
match_percent = []

# Iterate through the first dataframe
for i, row in df1.iterrows():
    name = row['Names']
    match = difflib.get_close_matches(name, df2['Names'], n=1, cutoff=0.8)
    if match:
        match_name.append(match[0])
        match_string = difflib.SequenceMatcher(None, name, match[0]).find_longest_match(0, len(name), 0, len(match[0]))
        match_percentage = (match_string.size / len(name)) * 100
        match_percent.append(match_percentage)
    else:
        match_name.append('NA')
        match_percent.append(0)
      

df1['File2names'] = match_name
df1['Match_per'] = match_percent
print(df1)

# Write in Excel
# df1.to_excel('output.xlsx', index=False)

I hope this helps you. This is the first time I am answering a question here. Read also: How to use SequenceMatcher to find similarity between two strings?