I have a program that retrieves headlines from news websites, and using a function gets the similarity of these headlines and returns ones with a high similarity. My list of headlines looks like this:
[
{"headline": "Headline Example", source: "Source Example"},
{"headline": "Headline Example", source: "Source Example"},
{"headline": "Headline Example", source: "Source Example"},
...And so on
]
And I can append the pair to a list by doing this:
similarList = []
for i in headlines:
for x in headlines:
if similar(i["headline"], x["headline"]) > .6:
similarList.append([i,x])
And this is the similar function:
# Checks similarity of 2 strings
from difflib import SequenceMatcher
def similar(a, b):
# Returns value between 0 and 1
return SequenceMatcher(None, a, b).ratio()
As many of you might point out, this code is long and redundant, but I don't really know any data sorting methods that could get it done quicker. Considering that this code with around 200 headlines takes over 10 seconds, this could be a problem. Are there any data sorting algorithms that might get it done quicker?
CodePudding user response:
We would need to see your code for the similar()
function to see if it would be possible to complete this in O(n)
. Although this would most likely be much more complicated than what you have now.
A simple performance improvement is to remove all the duplicate checks you are performing like this:
for iInd, i in enumerate(headlines):
for xInd in range(iInd 1, len(headlines)):
if similar(i["headline"], headlines[xInd]["headline"]) > .6:
similarList.append([i,x])
Edit:
So it seems like you could probably use an m-dimensional hashmap to complete this in O(nlogn), or maybe even O(n) with the right implementation. m would scale with the max length of the headlines. This would be extremely memory intensive though.
CodePudding user response:
According to its documentation, SequenceMatcher
runs in quadratic expected time (worst-case cubic, best-case linear). This means that the complexity of your original code is O(N^2 M^2), where N is the size of your array (the number of headlines) and M is the length of the strings (the length of the headlines). The rest of my answer makes sense if M is larger than your alphabet.
I would attempt to reduce the complexity by taking into account that we need the similarity ratio to be > 0.6. This requires the existence of some pretty long common part in the two strings. If you count the frequencies of characters in each string, and there are C common characters (counting duplicates) in a pair of strings, you only care for this pair if 2*C/(M1 M2) > 0.6. In case this happens, you have to check for similarity (which will cost M1*M2) but you will most probably save many checks.