How I fast up my code instead of using for loop?-CodePudding

Here I have two datasets "Y_N" and "data". "Y_N" have 8 thousand record and in "data" have 1.6 million records. in both dataset each record in the form of string. so my task is to match each record of "Y_N" with each record of "data" and calculate similarity index for each combination.

I did this by using for loop but its take much more time(probably 1 week)

so how I can speed up my code instead of using for loop is there any other way for that?

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

sm = [(Y_N['priceGuideDescription'][i], 
       data['priceGuideDescription'][j], 
       fuzz.ratio(Y_N['priceGuideDescription'][i], data['priceGuideDescription'][j])
      ) for i in range(len(Y_N))
        for j in range(0, 7000)
      ]

import pandas as pd
df = pd.DataFrame(sm)
df.head()

CodePudding user response：

The limiting factor seems to be the fuzz.ratio() function in the library you use, as it doesn't support vectorization. You could try running the calculation in parallel batches, using the multiprocessing module, while reading the larger input file sequentially instead all at once:

from functools import partial
from multiprocessing import Pool

def calc_ratio(y_n, data):
    return [(y, d, fuzz.ratio(y, d)) for y in y_n for d in data]

pool = Pool()
partial_ratio = partial(calc_ratio, Y_N)
with open('data.tsv') as data_handle:
    results = pool.imap(partial_ratio, data_handle, 1000)

You could try adjusting the chunk size (the third parameter of pool.imap()) for better performance. If the results are also too large to fit in memory, you could write them out to separate files.

CodePudding user response：

I would recommend the usage of RapidFuzz (I am the author) instead of fuzzywuzzy, which is significantly faster. You can replace your for loop with the following implementation:

import numpy as np
from rapidfuzz import fuzz, process

res = process.cdist(Y_N['priceGuideDescription'], data['priceGuideDescription'][0:7000],
    scorer=fuzz.ratio, dtype=np.uint8, workers=-1)

This will create a matrix of similarities between all elements pf the two sequences similar to scipy.spatial.distance.cdist.

CodePudding user response：

You have to iterate I’ve the data so you will be needing the for loop, but to increase the performance you can divide the load and assign the parts to different threads simultaneously or if you don’t need all the result at once use the generator function to divide the loads and you can fetch required portion that will be converted soon enough. It’s totally depends on your requirements .

CodePudding user response：

You can try to use numpy arrays as it is built in on top of C arrays which has more speed than regular python arrays.