Compare Similarity of two strings-CodePudding

I have a set of legacy data that has aimed to match companies together (based on a number of different factors). However, I would like to do a bit of QA on the matches just based on the names that came from the different profiles (and may have small differences).

Assuming a dataset as follows:

Match ID	Name Left	Name Right
1	LemonFarms	Lemon Farms Inc
2	Peachtree	PeachTree Farms
3	Tomato Grove	Orange Cheetah Farm

Is their an algorithmic or pythonic way to know for example that 1 and 2 are probably well matched and 3 badly matched? This, without any ML etc.

I would imagine I need to create a score somehow and knock out on this. However, would love some input if their is best practice.

Some ideas I have had is to look through patterns, compare letters in sets etc. However, at a lost to execute them.

CodePudding user response：

You can try fuzzywuzzy with score , then you just need to set up score limit for cut

from fuzzywuzzy import fuzz
df['score'] = df[['Name Left','Name Right']].apply(lambda x : fuzz.partial_ratio(*x),axis=1)
df
Out[134]: 
   Match ID     Name Left           Name Right  score
0         1    LemonFarms      Lemon Farms Inc     90
1         2     Peachtree      PeachTree Farms     89
2         3  Tomato Grove  Orange Cheetah Farm     13

CodePudding user response：

Is their an algorithmic or pythonic way to know for example that 1 and 2 are probably well matched and 3 badly matched? This, without any ML etc.

For computing similarity between two string you might use Levenshtein distance, so far I know it is not available in pandas so you would need other library for this task, I might suggest pylev if you want easy of use, but you might use elect to use another one, especially if you find desirable to use different cost for different replacements (so-called weighted Levenshtein distance)

Using your sample data:

import pandas as pd
import pylev
df = pd.DataFrame({"nameleft":["LemonFarms","Peachtree","Tomato Grove"],"nameright":["Lemon Farms Inc","PeachTree Farms","Orange Cheetah Farm"]})
df["distance"] = df.apply(lambda x:pylev.levenshtein(x.nameleft,x.nameright),axis=1)
print(df)

output

       nameleft            nameright  distance
0    LemonFarms      Lemon Farms Inc         5
1     Peachtree      PeachTree Farms         7
2  Tomato Grove  Orange Cheetah Farm        18

You would need to select cut-off, or value above which names are to be treated as different.

Above assumes you are allowed to use external libraries as you wish, if this is not case you would probably need to implement Levenshtein distance computing function yourself. If you are concered with equal-cost (basic) Levenshtein distance this should not be great problem, you might read few implementations in python in Rosetta Code.

CodePudding user response：

Levenshtein Distance

I am not expert in this in any case, but I have heard of the Levenshtein Distance before.

...the Levenshtein distance is a string metric for measuring the difference between two sequences.

It will give you a number defining how similar two strings are. The higher the number, the more similarities the two strings have. You can use it in python with the package fuzzywuzzy.

pip install fuzzywuzzy

Example:

from fuzzywuzzy import fuzz

n1 = "LemonFarms"
n2 = "Lemon Farms Inc"
m1 = "Tomato Grove"
m2 = "Orange Cheetah Farm"
print(fuzz.ratio(n1, n2))
print(fuzz.ratio(m1, m2))

This would return 80 for the first row of your table and 6 for the thrid row. Play around with this and let me know if you found something that satisfied you. Also more examples here.

CodePudding user response：

Levenshtein module implements the well-known Levenshtein fuzzy matching algorithm along with other related algorithms (e.g. Jaro, Jaro-Winkler, etc).

Levenshtein.jaro_winkler() is a string similarity metric that gives more weight to a common prefix, as spelling mistakes are more likely to occur near ends of words. It returns a numeric value from 0.0 to 1.0 where 1.0 is the most similar.

Levenshtein.distance() calculates the minimum number of insertions, deletions, and substitutions required to change one sequence into the other.

import pandas as pd
import Levenshtein

data = [[1, 'LemonFarms', 'Lemon Farms Inc'],
        [2, 'Peachtree', 'PeachTree Farms'],
        [3, 'Tomato Grove', 'Orange Cheetah Far']
        ]

df = pd.DataFrame(data, columns=['Match ID', 'Name Left', 'Name Right'])
df['score'] = df[['Name Left', 'Name Right']].apply(lambda x: Levenshtein.jaro_winkler(*x), axis=1)
df['distance'] = df[['Name Left', 'Name Right']].apply(lambda x: Levenshtein.distance(*x), axis=1)
print(df)

Output:

   Match ID     Name Left          Name Right     score  distance
0         1    LemonFarms     Lemon Farms Inc  0.933333         5
1         2     Peachtree     PeachTree Farms  0.884444         7
2         3  Tomato Grove  Orange Cheetah Far  0.431481        17

To improve the score, you can convert the values to lower case characters before calculating the similarity score so the score is case insensitive.

df['score'] = df[['Name Left', 'Name Right']].apply(
    lambda x: Levenshtein.jaro_winkler(x[0].lower(), x[1].lower()), axis=1)

This changes the score of Peachtree vs PeachTree Farms from 0.884444 to 0.920000. You can use a cut-off threshold; e.g. score >= 0.9 to indicate that strings are well matched and < 0.9 that they are badly matched.