Home > Software design >  Finding similarity in a pandas variable
Finding similarity in a pandas variable

Time:01-24

I have a dataset with company names as follows:

{0: 'SEEO INC',
 1: 'BOSCH GMBH ROBERT',
 2: 'SAMSUNG SDI CO LTD',
 12: 'NAGAI TAKAYUKI',
 21: 'WESTPORT POWER INC',
 26: 'SAMSUNG ELECTRONICS CO LTD',
 27: 'SATO TOSHIO',
 28: 'SUMITOMO ELECTRIC INDUSTRIES',
 31: 'TOSHIBA KK',
 35: 'TEIKOKU SEIYAKU KK',
 46: 'MITSUBISHI ELECTRIC CORP',
 47: 'IHI CORP',
 49: 'WEI XI',
 53: 'SIEMENS AG',
 56: 'HYUNDAI MOTOR CO LTD',
 57: 'COOPER TECHNOLOGIES CO',
 58: 'TSUI CHENG-WEN',
 64: 'UCHICAGO ARGONNE LLC',
 68: 'BAYERISCHE MOTOREN WERKE AG',
 70: 'YAMAWA MFG CO LTD',
 71: 'YAMAWA MFG. CO., LTD.'}

the problem is that some of those names refer to the exact same firm but are written differently (e.g. with special symbols as in 70 and 71, or with LIMIED rather than LTD and many others that I am not able to check as firms are 170000). Now I would like of course to call all of them the same way and thought about this strategy:

  1. check the similarities of the variable firms (the one displayed) maybe with Louvain similarity;
  2. Give the name of the firm to the most similar strings

However, I am not aware of any pandas instrument to perform 1. and am not sure of how to catch the name of the firm in 2. (e.g. YAMAWA in the example above) if not by taking the first word and hoping that this is actually the name of the firm.

Could you please me advise on how to perform 1? Is there a way to deal with situations like mine?

Thank you

CodePudding user response:

Use fuzzywuzzy combinations defaultdict

Usually, you would want to use fuzzy matching between strings to achieve this.

  1. You can use fuzzywuzzy.fuzz.partial_ratio or any other relevant fuzzy matching method to match 2 strings and see if they cross a threshold of similarity. More details here.

  2. You can use itertools.combinations to iterate over each combination of items in the dictionary for matching them against all other options. More details here.

  3. You can use collections.defaultdict, and more specifically defaultdict(list) to "reduce" the key:value dictionary a key:list_of_values dictionary if a given combination (point 2) passes the condition of fuzzy matching (point 1). More details here

NOTE: You will have to tune the parameter "Threshold" to make sure you are getting expected results on a larger example. The threshold=80 works for this small example.

Here is the code for this -

from collections import defaultdict
from itertools import combinations
from fuzzywuzzy import fuzz

threshold = 80       #<---- Hyperparameter
d = defaultdict(list)

for (i,ii),(j,jj) in combinations(master.items(),2):
    if ii not in d[i]:
        d[i].append(ii)
        
    if fuzz.partial_ratio(ii,jj)>=threshold and jj not in d[i]:
        d[i].append(jj)
        
final = dict(d)
final
{0: ['SEEO INC'],
 1: ['BOSCH GMBH ROBERT'],
 2: ['SAMSUNG SDI CO LTD'],
 12: ['NAGAI TAKAYUKI'],
 21: ['WESTPORT POWER INC'],
 26: ['SAMSUNG ELECTRONICS CO LTD'],
 27: ['SATO TOSHIO'],
 28: ['SUMITOMO ELECTRIC INDUSTRIES'],
 31: ['TOSHIBA KK'],
 35: ['TEIKOKU SEIYAKU KK'],
 46: ['MITSUBISHI ELECTRIC CORP'],
 47: ['IHI CORP'],
 49: ['WEI XI'],
 53: ['SIEMENS AG'],
 56: ['HYUNDAI MOTOR CO LTD'],
 57: ['COOPER TECHNOLOGIES CO'],
 58: ['TSUI CHENG-WEN'],
 64: ['UCHICAGO ARGONNE LLC'],
 68: ['BAYERISCHE MOTOREN WERKE AG'],
 70: ['YAMAWA MFG CO LTD', 'YAMAWA MFG. CO., LTD.']}

If you just want to remove the "partially duplicated" instances, instead of combining them in a dict with list values as above, then you can skip the use of collections.defaultdict and work directly with a dictionary. When you find another instance of a string similar to an existing one, you just pass and move to the next step.

Here is the code for that -

from itertools import combinations
from fuzzywuzzy import fuzz

threshold = 80       #<---- Hyperparameter
final = {}

for (i,ii),(j,jj) in combinations(master.items(),2):
    if ii != final.get(i):
        final[i] = ii
        
    if fuzz.partial_ratio(ii,jj)>=threshold:
        pass
    
final
{0: 'SEEO INC',
 1: 'BOSCH GMBH ROBERT',
 2: 'SAMSUNG SDI CO LTD',
 12: 'NAGAI TAKAYUKI',
 21: 'WESTPORT POWER INC',
 26: 'SAMSUNG ELECTRONICS CO LTD',
 27: 'SATO TOSHIO',
 28: 'SUMITOMO ELECTRIC INDUSTRIES',
 31: 'TOSHIBA KK',
 35: 'TEIKOKU SEIYAKU KK',
 46: 'MITSUBISHI ELECTRIC CORP',
 47: 'IHI CORP',
 49: 'WEI XI',
 53: 'SIEMENS AG',
 56: 'HYUNDAI MOTOR CO LTD',
 57: 'COOPER TECHNOLOGIES CO',
 58: 'TSUI CHENG-WEN',
 64: 'UCHICAGO ARGONNE LLC',
 68: 'BAYERISCHE MOTOREN WERKE AG',
 70: 'YAMAWA MFG CO LTD'}
  • Related