How can I make this Indexing algorithm more efficient?-CodePudding

I've got a Dataframe (deriving from a csv file with various columns) with 172033 rows. I've created a custom indexing function that blocks pairs of records that haven't got similar 'name' attributes. The problem resides in the efficiency of the algorithm. Just to get to the 10th iteration it takes about a minute. Therefore indexing the whole dataset would take way too much time. How can I make my algorithm more efficient?

class CustomIndex(BaseIndexAlgorithm):
    def _link_index(self, df_a, df_b):
        indici1=[]
        indici2=[]
        for i in range(0, 173033):
            if(i%2 == 0):
                print(i) #keeps track of the iteration
            for j in range(i, 173033):
                if(similar(df_a.loc[i, 'name'], df_a.loc[j, 'name'])>0.5):
                    indici1.append(i)
                    indici2.append(j)
        
        indici = [indici1, indici2]
        return pd.MultiIndex.from_arrays(indici, names=('first', 'second'))

I want to obtain a MultiIndex object, which would be an array of tuples contains the indexes of the pairs of records which are similar enough to not be blocked.

[MultiIndex([(     0,    0),
             (     0,    22159),
             (     0,    67902),
             (     0,    67903),
             (     1,    1),
             (     1,    1473),
             (     1,    5980),
             (     1,    123347),
             (     2,    2),
             ...

Here's the code for the similarity function:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

Here's an example of the dataframe I have as input:

   name
0  Amazon
1  Walmart
2  Apple
3  Amazon.com
4  Walmart Inc.

I would like the resulting MultiIndex to contain tuple links between 0 and 3, 1 and 4 and all the repetitions (0 and 0, 1 and 1 etc.)

CodePudding user response：

We asked you to reveal the similar() function but you've declined to do so.

You wrote

        for i in range(0, 173033):
            ...
            for j in range(i, 173033):
                if similar(df_a.loc[i, 'name'], df_a.loc[j, 'name']) > 0.5:

So you plan to call that mysterious similar 29_940_419_089 (30 billion) times. I'm guessing that's going to take some little while, maybe two weeks? We describe this as O(n^2) quadratic performance.

Here is the way out.

First, sort your dataset. It will only cost O(n log n), cheap!

Next, find something in the similar loss function that allows you to simplify the problem, or design a new related loss function allows that. Spoiler alert -- we can only do comparisons against local neighbors, not against far flung entries more than 100,000 positions away. I will focus on the five example words.

It seems likely that your mysterious function reports a large value for similar("Walmart", "Walmart Inc.") and a small value when comparing those strings against "Apple". So let's adopt a new cost function rule. If initial character of the two strings is not identical, the similarity is immediately reported as 0.

So now we're faced with comparing Apple against two Amazons, and Walmart with itself. We have partitioned the problem into "A" and "W" subsets. The quadratic monster is starting to die.

Minor side note: Since your similarity function likely is symmetric, it suffices to examine just half the square, where i < j.

A practical implementation for a dataset of this size is going to need to be more aggressive, perhaps insisting that initial 2 or 3 letters be identical. Perhaps you can run a hyphenation algorithm and look at identical syllables. Or use Metaphone or Soundex.

The simplest possible thing you could do is define a window W of maybe 10. When examining the sorted dataset, we arbitrarily declare that similarity between i, j entries shall be zero if abs(i - j) > W. This may impact accuracy, but it's great for performance, it lets you prune the search space before you even call the function. We went from O(n^2) to O(n) linear. Perfect accuracy is hardly relevant if you never wait long enough for the code to produce an answer.

Use these ideas to code up a solution, and let us know how you resolved the details.

CodePudding user response：

You are using .append method of list, that method according to PythonWiki is O(1) but Individual actions may take surprisingly long, depending on the history of the container.. You might use collections.deque which does have such quirks, just add import collections and do

indici1=collections.deque()
indici2=collections.deque()
...
indici = [list(indici1), list(indici2)]

If that would not help enough you would need similar function for possible improvements.