Home > Software design >  Efficient way of generating new columns having string similarity distances between two string column
Efficient way of generating new columns having string similarity distances between two string column

Time:04-30

I have a pandas dataframe with shape (1138812, 14) and columns

['id', 'name', 'latitude', 'longitude', 'address', 'city', 'state',
       'zip', 'country', 'url', 'phone', 'categories', 'point_of_interest',
       'id_2', 'name_2', 'latitude_2', 'longitude_2', 'address_2', 'city_2',
       'state_2', 'zip_2', 'country_2', 'url_2', 'phone_2', 'categories_2',
       'point_of_interest_2', 'match']

I want to create new columns based on the string similarity distances using Levenshtein and difflib difflib.SequenceMatcher().ratio(), Levenshtein.distance(), Levenshtein.jaro_winkler() and LongestCommonSubstring() between each of the columns

['name', 'address', 'city', 'state',
       zip', 'country', 'url', 'phone', 'categories']

and corresponding _2 suffixed columns. In the end it will give me 9*4 = 36 new columns. Right now, I am using df.iterrows() to loop through the dataframe and make column lists. But it is very very time and memory consuming. It takes 3.5 hours to go through the whole dataframe while using full 16GB ram memory. I am trying to find a better method both time and memory wise to get my result. My code:

import Levenshtein
import difflib
from tqdm.notebook import tqdm
columns = ['name', 'address', 'city', 'state',
           'zip', 'country', 'url', 'phone', 'categories']
data_dict = {}
for i in columns:
    data_dict[f"{i}_geshs"] = []
    data_dict[f"{i}_levens"] = []
    data_dict[f"{i}_jaros"] = []
    data_dict[f"{i}_lcss"] = []
for i,row in tqdm(train.iterrows(),total = train.shape[0]):
    for j in columns:
        data_dict[f"{j}_geshs"].append(difflib.SequenceMatcher(None, row[j], row[f"{j}_2"]).ratio())
        data_dict[f"{j}_levens"].append(Levenshtein.distance(row[j], row[f"{j}_2"]))
        data_dict[f"{j}_jaros"].append(Levenshtein.jaro_winkler(row[j], row[f"{j}_2"]))
        data_dict[f"{j}_lcss"].append(LCS(str(row[j]), str(row[f"{j}_2"])))
data = pd.DataFrame(data_dict)
train = pd.concat(train, data, axis = 1)

CodePudding user response:

Starting with a dataframe that looks like:

first_name address city state zip url phone categories first_name_2 address_2 city_2 state_2 zip_2 url_2 phone_2 categories_2
Rori 680 Buell Crossing Dallas Texas 75277 url_shortened 214-533-2179 Granite Surfaces Agustin 7 Schiller Crossing Lubbock Texas 79410 url_shortened 806-729-7419 Roofing (Metal)
Dmitri 05 Coolidge Way Charleston West Virginia 25356 url_shortened 304-906-6384 Structural and Misc Steel (Fabrication) Kearney 0547 Clemons Plaza Peoria Illinois 61651 url_shortened 309-326-4252 Framing (Steel)

And is of shape 1024000 rows × 16 columns

import difflib
import Levenshtein
import numpy as np
import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=8) # Customize based on # of cores, or leave blank to use all

def dists(x, y):
    matcher = difflib.SequenceMatcher(None, x, y)
    geshs = matcher.ratio()
    levens = Levenshtein.distance(x, y)
    jaros = Levenshtein.jaro_winkler(x, y)
    lcss = matcher.find_longest_match(0, len(x), len(y)) # I wasn't sure how you'd done this one.
    return [geshs, levens, jaros, lcss]


df = pd.read_csv('MOCK_DATA.csv')
df = df.astype(str) # force all fields to strings.

cols = df.columns
cols = np.array_split(cols, 2) # assumes there's a matching `_2` column for every column.
for x, y in zip(*cols):
    (df[x   '_geshs'], 
     df[x   '_levens'], 
     df[x   '_jaros'], 
     df[x   '_lcss']) = df.parallel_apply(lambda z: dists(z[x], z[y]), axis=1, result_type='expand')
    # Replace parallel_apply with apply to run non-parallel.

(In addition to keeping the original columns) I get these columns in 3 minutes, without parallezation, it would still probably only take ~20-30 minutes. Peak memory usage from python was only about 3GB, and would be much lower without parallezation.

first_name_geshs first_name_levens first_name_jaros first_name_lcss address_geshs address_levens address_jaros address_lcss city_geshs city_levens city_jaros city_lcss state_geshs state_levens state_jaros state_lcss zip_geshs zip_levens zip_jaros zip_lcss url_geshs url_levens url_jaros url_lcss phone_geshs phone_levens phone_jaros phone_lcss categories_geshs categories_levens categories_jaros categories_lcss
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
  • Related