I have two large dictionaries of lists. All the elements of the lists are strings. I want to compare all against all and compute their respective similarity - but the naive way I use is obviously very slow and does not scale at all:
import numpy as np
import difflib
first_dict = {"first1" : ["aa", "bb","cc", "dd"], "first2" : ["ff", "gg"]}
second_dict = {"second1" : ["cc", "dd", "jj", "aa", "bb"], "second2" : ["ff", "gg"], "second3" : ["hh", "ii"]}
mat = np.empty((len(second_dict), len(first_dict)))
for i, second in enumerate(second_dict.keys()):
for j, first in enumerate(first_dict.keys()):
sm = difflib.SequenceMatcher(None, sorted(first_dict[first]), sorted(second_dict[second]))
mat[i, j] = sm.ratio()
print(mat)
is there a clever way to speed this up?
CodePudding user response:
Your code seems legit. I did a couple of little tweaks that would shave off a couple of microseconds per loop:
- No need for the two
sorted
calls becausedifflib
can calculated an order-indifferent comparison withquick_ratio
(Checkout the documentation here for the difference betweenratio
,quick_ratio
, andreal_quick_ratio
). - No need for the
enumerate
to accessmat
byi
andj
. - Removed the access of the list through index
first_dict[index]
andsecond_dict[index]
def naive_ratio_comparison(first_dict, second_dict):
mat = []
for second in second_dict.values():
for first in first_dict.values():
sm = difflib.SequenceMatcher(None, first, second)
mat.append(sm.quick_ratio())
result = np.resize(mat, (len(second_dict), len(first_dict)))
return result
CodePudding user response:
If one dict has M
entries and the other N
, then you're going to have to do M*N
.ratio()
calls. There's no way around that, and it's going to be costly.
However, you can easily arrange to do only M N
sorts instead of (as shown) M*N
sorts.
For computing .ratio()
, the most valuable hint is in the docs:
SequenceMatcher
computes and caches detailed information about the second sequence, so if you want to compare one sequence against many sequences, useset_seq2()
to set the commonly used sequence once and callset_seq1()
repeatedly, once for each of the other sequences.
Putting that all together:
firsts = list(map(sorted, first_dict.values())) # sort these only once
sm = difflib.SequenceMatcher(None)
for i, second in enumerate(second_dict.values()):
sm.set_seq2(sorted(second))
for j, first in enumerate(firsts):
sm.set_seq1(first)
mat[i, j] = sm.ratio()
That should deliver exactly the same results. To minimize the number of expensive .set_seq2()
calls, it would - of course - be best to arrange for the shorter dict to be called "second_dict".