I have a set of legacy data that has aimed to match companies together (based on a number of different factors). However, I would like to do a bit of QA on the matches just based on the names that came from the different profiles (and may have small differences).
Assuming a dataset as follows:
Match ID | Name Left | Name Right |
---|---|---|
1 | LemonFarms | Lemon Farms Inc |
2 | Peachtree | PeachTree Farms |
3 | Tomato Grove | Orange Cheetah Farm |
Is their an algorithmic or pythonic way to know for example that 1 and 2 are probably well matched and 3 badly matched? This, without any ML etc.
I would imagine I need to create a score somehow and knock out on this. However, would love some input if their is best practice.
Some ideas I have had is to look through patterns, compare letters in sets etc. However, at a lost to execute them.
CodePudding user response:
You can try fuzzywuzzy
with score , then you just need to set up score limit for cut
from fuzzywuzzy import fuzz
df['score'] = df[['Name Left','Name Right']].apply(lambda x : fuzz.partial_ratio(*x),axis=1)
df
Out[134]:
Match ID Name Left Name Right score
0 1 LemonFarms Lemon Farms Inc 90
1 2 Peachtree PeachTree Farms 89
2 3 Tomato Grove Orange Cheetah Farm 13
CodePudding user response:
Is their an algorithmic or pythonic way to know for example that 1 and 2 are probably well matched and 3 badly matched? This, without any ML etc.
For computing similarity between two string you might use Levenshtein distance, so far I know it is not available in pandas
so you would need other library for this task, I might suggest pylev if you want easy of use, but you might use elect to use another one, especially if you find desirable to use different cost for different replacements (so-called weighted Levenshtein distance)
Using your sample data:
import pandas as pd
import pylev
df = pd.DataFrame({"nameleft":["LemonFarms","Peachtree","Tomato Grove"],"nameright":["Lemon Farms Inc","PeachTree Farms","Orange Cheetah Farm"]})
df["distance"] = df.apply(lambda x:pylev.levenshtein(x.nameleft,x.nameright),axis=1)
print(df)
output
nameleft nameright distance
0 LemonFarms Lemon Farms Inc 5
1 Peachtree PeachTree Farms 7
2 Tomato Grove Orange Cheetah Farm 18
You would need to select cut-off, or value above which names are to be treated as different.
Above assumes you are allowed to use external libraries as you wish, if this is not case you would probably need to implement Levenshtein distance computing function yourself. If you are concered with equal-cost (basic) Levenshtein distance this should not be great problem, you might read few implementations in python
in Rosetta Code.
CodePudding user response:
Levenshtein Distance
I am not expert in this in any case, but I have heard of the Levenshtein Distance before.
...the Levenshtein distance is a string metric for measuring the difference between two sequences.
It will give you a number defining how similar two strings are. The higher the number, the more similarities the two strings have. You can use it in python with the package fuzzywuzzy.
pip install fuzzywuzzy
Example:
from fuzzywuzzy import fuzz
n1 = "LemonFarms"
n2 = "Lemon Farms Inc"
m1 = "Tomato Grove"
m2 = "Orange Cheetah Farm"
print(fuzz.ratio(n1, n2))
print(fuzz.ratio(m1, m2))
This would return 80 for the first row of your table and 6 for the thrid row. Play around with this and let me know if you found something that satisfied you. Also more examples here.
CodePudding user response:
Levenshtein module implements the well-known Levenshtein fuzzy matching algorithm along with other related algorithms (e.g. Jaro, Jaro-Winkler, etc).
Levenshtein.jaro_winkler() is a string similarity metric that gives more weight to a common prefix, as spelling mistakes are more likely to occur near ends of words. It returns a numeric value from 0.0 to 1.0 where 1.0 is the most similar.
Levenshtein.distance() calculates the minimum number of insertions, deletions, and substitutions required to change one sequence into the other.
import pandas as pd
import Levenshtein
data = [[1, 'LemonFarms', 'Lemon Farms Inc'],
[2, 'Peachtree', 'PeachTree Farms'],
[3, 'Tomato Grove', 'Orange Cheetah Far']
]
df = pd.DataFrame(data, columns=['Match ID', 'Name Left', 'Name Right'])
df['score'] = df[['Name Left', 'Name Right']].apply(lambda x: Levenshtein.jaro_winkler(*x), axis=1)
df['distance'] = df[['Name Left', 'Name Right']].apply(lambda x: Levenshtein.distance(*x), axis=1)
print(df)
Output:
Match ID Name Left Name Right score distance
0 1 LemonFarms Lemon Farms Inc 0.933333 5
1 2 Peachtree PeachTree Farms 0.884444 7
2 3 Tomato Grove Orange Cheetah Far 0.431481 17
To improve the score, you can convert the values to lower case characters before calculating the similarity score so the score is case insensitive.
df['score'] = df[['Name Left', 'Name Right']].apply(
lambda x: Levenshtein.jaro_winkler(x[0].lower(), x[1].lower()), axis=1)
This changes the score of Peachtree vs PeachTree Farms from 0.884444 to 0.920000. You can use a cut-off threshold; e.g. score >= 0.9 to indicate that strings are well matched and < 0.9 that they are badly matched.