Home > other >  Most efficient way to cross compare millions of hash values in a list
Most efficient way to cross compare millions of hash values in a list

Time:03-19

I have a list of 9 million hash values. I need to compare each value(hash0) in the list with the rest of the values :

for i, hash0 in enumerate(hashes_list):
    for hash1 in hashes_list[i:]:
       if hash0 -hash1 < threshold:
          #do something

This solution above is of quadratic complexity, which is taking forever to run (even in a server). What is an efficient way to cross-match these 9 million hashes?

Here is a sample of the hashes_list values :

8c59ac5169e673a6
ab9f545497b05683 
9590ee98373e1e19 
c1274a5e1e150e7f
938f7c782dc6241b

CodePudding user response:

Since you don't want to compare exact matches of your values, that rules out easily using sets or dicts -

but you can certainly benefit from using a better data structure that could be more fit for the purpose.

If the value comparison you need is numeric, as it seems in your code, it looks like simply sorting the list (and sorting 9 million values is quite feasible), and comparing the neighbors in the result could suffice to reduce your complexity from O(n**2) to O(n).

CodePudding user response:

Assuming the subtraction is just a regular subtraction, try sorting first, Sorts can be O(n Ln(n)) time complexity which is a little better than n^2

That way you could iterate once with two pointers finding groups of hashes that are all close to each other. This would be n*k complexity with n being the size of the hashes and k being the average number that match.

The pseudo code would look something like

sort(hashes_list) #large to small
count = size(hashes_list)
i = 0
while i < count:
     j = i   1
     while hashes_list[i] - hashes_list[j] < threshold:
         #do something
         j  = 1
     i  = 1

you might be able to skip the check in some cases. For example where 0 - 10 all are within the threshold, then 1-10 would also be and the "#do something" would just have to be called for each without another check

  • Related