So I have a ginormous list of 466,550 words and I need to compare the words with each other to check the similarity between them. Similarity between two words is defined as the number of common characters between them in my case. But if I try :
for i in words:
for j in words:
if(i!=j):
temp.append(similar(i,j))
sim.append(max(temp))
temp = []
It'll run 466,550^2 times and is taking an eternity to get the final output. So is there another way to do the same thing in linear time?
Edit: Similar has a pretty basic definition as follows:
def similar(a, b):
x = list(set(a)&set(b))
return len(x)
I looked up a way to compare words and then just returned the total number of similar characters for each comparison
Edit 2:
So here's the whole question for those interested:
John and Billy are two good friends. John wanted to hack the password of Billy's Insta account just for fun. The password is an English word from this dictionary:
https://raw.githubusercontent.com/dwyl/english-words/master/words.txt.
He cannot try out all the words from the dictionary as each try will take at least 30 seconds. But he knows the following clues about password.
- Clue 1: The word has highest similarity with other words in dictionary. Similarity between two words is defined by number of common characters.
- Clue 2: The word has large number of vowels in it.
- Clue 3: The word has large number of characters in it. Create 3 independent word lists ranking words based on each clue. Finally rank each word based on their position (weightage of 0.33 for each clue) in these 3 lists. Come up with top 100 potential words based on final rank.
CodePudding user response:
I think an important part of this problem is how you interpret the clause "highest similarity with other words in dictionary" in the problem description. If you take that to mean the highest sum of similarities with all the other words, you can compute that sum in a more efficient way than your nested loops.
Rather than directly comparing all words, you can instead count how many different words contain each letter. Then in a second pass, you can add up the number of other words that share each of a word's letters, which is almost the sum of similarities we want. We just need to subtract out the word's similarity with itself (which is the number of unique letters it contains), since we're only supposed to compare to other words.
Here's a function that computes a list of similarities in the same order the words are in:
from collections import Counter
def compute_similarities(words):
letter_count = Counter()
for word in words: # first pass
letter_count.update(set(word)) # count each letter
similarities = []
for word in words: # second pass
letters = set(word)
similarities.append(sum(letter_count[letter] for letter in letters)
- len(letters)) # correct for self similarity
return similarities
Here's an example run, with the names of the months as our dictionary:
>>> months = [
"january", "february", "march", "april",
"may", "june", "july", "august",
"september", "october", "november", "december",
]
>>> compute_similarities(months)
[23, 28, 18, 14, 12, 13, 10, 12, 24, 21, 23, 22]
So, it looks like February is the most similar to the other months. To validate that this is counting correctly (given my assumption about what "similarity with other words" means for each word), we can compare its results with another version, based on your similarity function and nested loops (which works just fine for short word lists).
def similar(a, b):
x = list(set(a)&set(b))
return len(x)
def compute_similarities_bruteforce(words):
similarities = []
for i in words:
total = 0
for j in words:
if(i!=j):
total = similar(i,j)
similarities.append(total)
return similarities
Test run:
>>> compute_similarities_bruteforce(months)
[23, 28, 18, 14, 12, 13, 10, 12, 24, 21, 23, 22]
After downloading the word list you're supposed to use (and very lightly pre-processing it, e.g. making everything lowercase), I found these to be the most similar words in the list:
>>> similarities = compute_similarities(words)
>>> simiarity_sorted_words = sorted(zip(similarities, words), reverse=True)
>>> simiarity_sorted_words[:20]
[(3059877, 'pseudolamellibranchiate'),
(3059877, 'pseudolamellibranchiata'),
(2973121, 'pseudoconglomeration'),
(2966493, 'pseudoanthropological'),
(2956212, 'pseudoromantically'),
(2949584, 'spondylotherapeutics'),
(2949584, 'pseudophilanthropically'),
(2949584, 'pseudohallucinatory'),
(2946269, 'neuropharmacologist'),
(2932039, 'salpingo-oophorectomy'),
(2929360, 'hyperconstitutionalism'),
(2925092, 'encephalomyocarditis'),
(2887146, 'sphygmomanometrically'),
(2887146, 'semianthropologically'),
(2887146, 'oophorosalpingectomy'),
(2884111, 'pseudoconservatively'),
(2880336, 'pneumatico-hydraulic'),
(2875526, 'quasi-complimentary'),
(2874192, 'cloth-measuring'),
(2873602, 'cloth-spreading')]
The ties are generally sets of words that contain exactly the same letters (which aren't always obvious, since there are often quite a lot of duplicated letters). You might get slightly different results if you preprocess the word list differently (e.g. keeping case sensitivity, or filtering out non-letter characters like hyphens and apostrophes).