Faster way of comparing all data in list-CodePudding

I have a program that retrieves headlines from news websites, and using a function gets the similarity of these headlines and returns ones with a high similarity. My list of headlines looks like this:

[
{"headline": "Headline Example", source: "Source Example"},
{"headline": "Headline Example", source: "Source Example"},
{"headline": "Headline Example", source: "Source Example"},
...And so on
]

And I can append the pair to a list by doing this:

similarList = []

for i in headlines:

    for x in headlines:

        if similar(i["headline"], x["headline"]) > .6:
            
           similarList.append([i,x])

And this is the similar function:


# Checks similarity of 2 strings
from difflib import SequenceMatcher
def similar(a, b):

    # Returns value between 0 and 1
    return SequenceMatcher(None, a, b).ratio()

As many of you might point out, this code is long and redundant, but I don't really know any data sorting methods that could get it done quicker. Considering that this code with around 200 headlines takes over 10 seconds, this could be a problem. Are there any data sorting algorithms that might get it done quicker?

CodePudding user response：

We would need to see your code for the similar() function to see if it would be possible to complete this in O(n). Although this would most likely be much more complicated than what you have now.

A simple performance improvement is to remove all the duplicate checks you are performing like this:

for iInd, i in enumerate(headlines):

    for xInd in range(iInd 1, len(headlines)):
        
        if similar(i["headline"], headlines[xInd]["headline"]) > .6:
            
           similarList.append([i,x])

Edit:

So it seems like you could probably use an m-dimensional hashmap to complete this in O(nlogn), or maybe even O(n) with the right implementation. m would scale with the max length of the headlines. This would be extremely memory intensive though.

CodePudding user response：

According to its documentation, SequenceMatcher runs in quadratic expected time (worst-case cubic, best-case linear). This means that the complexity of your original code is O(N^2 M^2), where N is the size of your array (the number of headlines) and M is the length of the strings (the length of the headlines). The rest of my answer makes sense if M is larger than your alphabet.

I would attempt to reduce the complexity by taking into account that we need the similarity ratio to be > 0.6. This requires the existence of some pretty long common part in the two strings. If you count the frequencies of characters in each string, and there are C common characters (counting duplicates) in a pair of strings, you only care for this pair if 2*C/(M1 M2) > 0.6. In case this happens, you have to check for similarity (which will cost M1*M2) but you will most probably save many checks.