Efficient way of generating new columns having string similarity distances between two string column-CodePudding

I have a pandas dataframe with shape (1138812, 14) and columns

['id', 'name', 'latitude', 'longitude', 'address', 'city', 'state',
       'zip', 'country', 'url', 'phone', 'categories', 'point_of_interest',
       'id_2', 'name_2', 'latitude_2', 'longitude_2', 'address_2', 'city_2',
       'state_2', 'zip_2', 'country_2', 'url_2', 'phone_2', 'categories_2',
       'point_of_interest_2', 'match']

I want to create new columns based on the string similarity distances using Levenshtein and difflib difflib.SequenceMatcher().ratio(), Levenshtein.distance(), Levenshtein.jaro_winkler() and LongestCommonSubstring() between each of the columns

['name', 'address', 'city', 'state',
       zip', 'country', 'url', 'phone', 'categories']

and corresponding _2 suffixed columns. In the end it will give me 9*4 = 36 new columns. Right now, I am using df.iterrows() to loop through the dataframe and make column lists. But it is very very time and memory consuming. It takes 3.5 hours to go through the whole dataframe while using full 16GB ram memory. I am trying to find a better method both time and memory wise to get my result. My code:

import Levenshtein
import difflib
from tqdm.notebook import tqdm
columns = ['name', 'address', 'city', 'state',
           'zip', 'country', 'url', 'phone', 'categories']
data_dict = {}
for i in columns:
    data_dict[f"{i}_geshs"] = []
    data_dict[f"{i}_levens"] = []
    data_dict[f"{i}_jaros"] = []
    data_dict[f"{i}_lcss"] = []
for i,row in tqdm(train.iterrows(),total = train.shape[0]):
    for j in columns:
        data_dict[f"{j}_geshs"].append(difflib.SequenceMatcher(None, row[j], row[f"{j}_2"]).ratio())
        data_dict[f"{j}_levens"].append(Levenshtein.distance(row[j], row[f"{j}_2"]))
        data_dict[f"{j}_jaros"].append(Levenshtein.jaro_winkler(row[j], row[f"{j}_2"]))
        data_dict[f"{j}_lcss"].append(LCS(str(row[j]), str(row[f"{j}_2"])))
data = pd.DataFrame(data_dict)
train = pd.concat(train, data, axis = 1)

CodePudding user response：

Starting with a dataframe that looks like:

first_name	address	city	state	zip	url	phone	categories	first_name_2	address_2	city_2	state_2	zip_2	url_2	phone_2	categories_2
Rori	680 Buell Crossing	Dallas	Texas	75277	url_shortened	214-533-2179	Granite Surfaces	Agustin	7 Schiller Crossing	Lubbock	Texas	79410	url_shortened	806-729-7419	Roofing (Metal)
Dmitri	05 Coolidge Way	Charleston	West Virginia	25356	url_shortened	304-906-6384	Structural and Misc Steel (Fabrication)	Kearney	0547 Clemons Plaza	Peoria	Illinois	61651	url_shortened	309-326-4252	Framing (Steel)

And is of shape 1024000 rows × 16 columns

import difflib
import Levenshtein
import numpy as np
import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=8) # Customize based on # of cores, or leave blank to use all

def dists(x, y):
    matcher = difflib.SequenceMatcher(None, x, y)
    geshs = matcher.ratio()
    levens = Levenshtein.distance(x, y)
    jaros = Levenshtein.jaro_winkler(x, y)
    lcss = matcher.find_longest_match(0, len(x), len(y)) # I wasn't sure how you'd done this one.
    return [geshs, levens, jaros, lcss]


df = pd.read_csv('MOCK_DATA.csv')
df = df.astype(str) # force all fields to strings.

cols = df.columns
cols = np.array_split(cols, 2) # assumes there's a matching `_2` column for every column.
for x, y in zip(*cols):
    (df[x   '_geshs'], 
     df[x   '_levens'], 
     df[x   '_jaros'], 
     df[x   '_lcss']) = df.parallel_apply(lambda z: dists(z[x], z[y]), axis=1, result_type='expand')
    # Replace parallel_apply with apply to run non-parallel.

(In addition to keeping the original columns) I get these columns in 3 minutes, without parallezation, it would still probably only take ~20-30 minutes. Peak memory usage from python was only about 3GB, and would be much lower without parallezation.

first_name_geshs	first_name_levens	first_name_jaros	first_name_lcss	address_geshs	address_levens	address_jaros	address_lcss	city_geshs	city_levens	city_jaros	city_lcss	state_geshs	state_levens	state_jaros	state_lcss	zip_geshs	zip_levens	zip_jaros	zip_lcss	url_geshs	url_levens	url_jaros	url_lcss	phone_geshs	phone_levens	phone_jaros	phone_lcss	categories_geshs	categories_levens	categories_jaros	categories_lcss
0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3
0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3

first_name_geshs	first_name_levens	first_name_jaros	first_name_lcss	address_geshs	address_levens	address_jaros	address_lcss	city_geshs	city_levens	city_jaros	city_lcss	state_geshs	state_levens	state_jaros	state_lcss	zip_geshs	zip_levens	zip_jaros	zip_lcss	url_geshs	url_levens	url_jaros	url_lcss	phone_geshs	phone_levens	phone_jaros	phone_lcss	categories_geshs	categories_levens	categories_jaros	categories_lcss
0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3
0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3

first_name_geshs	first_name_levens	first_name_jaros	first_name_lcss	address_geshs	address_levens	address_jaros	address_lcss	city_geshs	city_levens	city_jaros	city_lcss	state_geshs	state_levens	state_jaros	state_lcss	zip_geshs	zip_levens	zip_jaros	zip_lcss	url_geshs	url_levens	url_jaros	url_lcss	phone_geshs	phone_levens	phone_jaros	phone_lcss	categories_geshs	categories_levens	categories_jaros	categories_lcss
0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3
0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3

first_name_geshs	first_name_levens	first_name_jaros	first_name_lcss	address_geshs	address_levens	address_jaros	address_lcss	city_geshs	city_levens	city_jaros	city_lcss	state_geshs	state_levens	state_jaros	state_lcss	zip_geshs	zip_levens	zip_jaros	zip_lcss	url_geshs	url_levens	url_jaros	url_lcss	phone_geshs	phone_levens	phone_jaros	phone_lcss	categories_geshs	categories_levens	categories_jaros	categories_lcss
0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3
0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3