Fastest way to identify all pairs of lists that their difference is lower than a given threshold whe-CodePudding

ahi, everyone. sorry to bother you.

I have this task that I have a list of hash codings stored in a list with 30 positions with value 0 and 1. In total, I have over 10000 such 30 size (0/1) hash codes and I would like to find all pairs of such hash codes which have the difference lower than a given threshold (say 0, 1, 5), in which case this pair would be considered as "similar" hash codings.

I have realised this using double "for loop" in python3 (see code below), but I do not feel it is efficient enough, as this seems to be a O(N!), and it is indeed slow when N = 10000 or even larger.

My question would be is there better way we could speed this finding similar hash pairs up ? Ideally, in O(N) I suppose ?

Note by efficiency I mean finding similar pairs given thershold rather than generating hash codings (this is only for demonstration).

I have digged in this problem a little bit, all the answers I have found is talking about using some sort of collection tools to find identical pairs, but here I have a more general case that the pairs could also be similiar given a threshold.

I have provided the code that generates sample hashing codings and the current low efficient program I am using. I hope you may find this problem interesting and hopefully some better/smarter/senior programmer could lend me a hand on this one. Thanks in advance.

import random
import numpy as np

# HashCodingSize = 10
# Just use this to test the program
HashCodingSize = 100
# HashCodingSize = 1000
# What can we do when we have the list over 10000, 100000 size ? 
# This is where the problem is 
# HashCodingSize = 10000
# HashCodingSize = 100000

#Generating "HashCodingSize" of list with each element has size of 30
outputCodingAllPy = []
for seed in range(HashCodingSize):
    random.seed(seed)
    listLength = 30
    numZero = random.randint(1, listLength)
    numOne = listLength - numZero
    my_list = [0] * numZero   [1] * numOne
    random.shuffle(my_list)
    # print(my_list)
    outputCodingAllPy.append(my_list)

#Covert to np array which is better than python3 list I suppose?
outputCodingAll = np.asarray(outputCodingAllPy)
print(outputCodingAll)
print("The N is", len(outputCodingAll))

hashDiffThreshold = 0
#hashDiffThreshold = 1
#hashDiffThreshold = 5
loopRange = range(outputCodingAll.shape[0])
samePairList = []

#This is O(n!) I suppose, is there better way ? 
for i in loopRange:
    for j in loopRange:
        if j > i:
            if (sum(abs(outputCodingAll[i,] - outputCodingAll[j,])) <= hashDiffThreshold):
                print("The pair (",  str(i), ", ", str(j), ") ")
                samePairList.append([i, j])

print("Following pairs are considered the same given the threshold ", hashDiffThreshold)
print(samePairList)

CodePudding user response：

If you only need 30-bit vectors, it would be much better to represent then as 30 bits in a 32-bit integer. Then the Hamming distance between two "vectors" is just the number of bits in the xor of the two integers. There are efficient algorithms for computing the number of non-zero bits in an integer. Those can be readily vectorized using numpy.

So the algorithm is:

generate HashCodingSize random integers between 0 and (1<<30)-1. That's one line with numpy.random.randint()
for each value xor it with the array (see numpy.bitwise_xor), compute the number of bits in each xor output value (vectorize one of the bit count algorithms), and find the indices whose bit count is less than or equal to hashDiffThreshold

This is still O(n^2), but is just a single loop in python; each operation in the loop operates on a length-n vector with numpy calls.

CodePudding user response：

As long as your listLength is within the size of an integer on your computer, I would use integers instead. Then you can xor the values (using broadcasting to xor all values against each other at once) to get the number of bits that are different, sum those bits and then use nonzero to find indexes that fit the requirement hash difference requirement. For example:

import numpy as np
import random

HashCodingSize = 10
listLength = 30
outputCodingAll = np.array([random.choice(range(2**listLength)) for _ in range(HashCodingSize)])
# sample result
# array([995834408, 173548139, 717311089,  87822983, 813938401, 
#        363814224, 970707528, 907497995, 337492435, 361696322])

distance = bit_count(outputCodingAll[:, np.newaxis] ^ outputCodingAll)
# sample result
# array([[ 0, 10, 15, 18, 14, 18,  8, 12, 18, 16],
#        [10,  0, 13, 14, 16, 24, 14, 14, 16, 18],
#        [15, 13,  0, 23, 13, 15, 15, 17, 19, 15],
#        [18, 14, 23,  0, 18, 16, 18, 12, 12, 14],
#        [14, 16, 13, 18,  0, 16, 12, 14, 14, 14],
#        [18, 24, 15, 16, 16,  0, 14, 16, 12,  6],
#        [ 8, 14, 15, 18, 12, 14,  0, 12, 18, 14],
#        [12, 14, 17, 12, 14, 16, 12,  0, 14, 14],
#        [18, 16, 19, 12, 14, 12, 18, 14,  0, 12],
#        [16, 18, 15, 14, 14,  6, 14, 14, 12,  0]], dtype=int32)

hashDiffThreshold = 10
samePairList = np.transpose(np.nonzero(distance < hashDiffThreshold))
# sample result
# array([[0, 0],
#        [0, 6],
#        [1, 1],
#        [2, 2],
#        [3, 3],
#        [4, 4],
#        [5, 5],
#        [5, 9],
#        [6, 0],
#        [6, 6],
#        [7, 7],
#        [8, 8],
#        [9, 5],
#        [9, 9]], dtype=int64)

Note the result repeats pairs (e.g. [5, 9] and [9, 5]) as they are all tested as both the first and second operand). It also includes each value tested against itself (which is obviously 0). These results can be easily filtered out if desired.

Note if you want to convert any of the values to lists of 1 and 0 you can format the numbers as binary strings of length listLength and map each character to an int e.g.

list(map(int, f'{outputCodingAll[0]:0{listLength}b}'))
# sample output
# [0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1]

This code uses the bit_count function from this answer:

def bit_count(arr):
    # Make the values type-agnostic (as long as it's integers)
    t = arr.dtype.type
    mask = t(-1)
    s55 = t(0x5555555555555555 & mask)  # Add more digits for 128bit support
    s33 = t(0x3333333333333333 & mask)
    s0F = t(0x0F0F0F0F0F0F0F0F & mask)
    s01 = t(0x0101010101010101 & mask)
    
    arr = arr - ((arr >> 1) & s55)
    arr = (arr & s33)   ((arr >> 2) & s33)
    arr = (arr   (arr >> 4)) & s0F
    return (arr * s01) >> (8 * (arr.itemsize - 1))

CodePudding user response：

This variant produces the same result as your code

outer_not_equal = np.not_equal.outer(outputCodingAll, outputCodingAll)

diff_count_matrix = outer_not_equal.sum((1, 3)) // outputCodingAll.shape[1]

samePairNumpy = np.transpose(np.where(diff_count_matrix <= hashDiffThreshold))

# filtering out diagonal values
samePairNumpy = samePairNumpy[samePairNumpy[:, 0] != samePairNumpy[:, 1]]

# filtering out duplicates above diagonal
samePairNumpy.sort(axis=1)
samePairNumpy = np.unique(samePairNumpy, axis=0)

array([[22, 23],
       [22, 54],
       [22, 62],
       [22, 70],
       [22, 93],
       [23, 54],
       [23, 62],
       [23, 70],
       [23, 93],
       [54, 62],
       [54, 70],
       [54, 93],
       [62, 70],
       [62, 93],
       [70, 93]], dtype=int64)

the same result as your code produces

#Generating "HashCodingSize" of list with each element has size of 30
outputCodingAllPy = []
for seed in range(HashCodingSize):
    random.seed(seed)
    listLength = 30
    numZero = random.randint(1, listLength)
    numOne = listLength - numZero
    my_list = [0] * numZero   [1] * numOne
    random.shuffle(my_list)
    # print(my_list)
    outputCodingAllPy.append(my_list)

#Covert to np array which is better than python3 list I suppose?
outputCodingAll = np.asarray(outputCodingAllPy)
print(outputCodingAll)
print("The N is", len(outputCodingAll))

hashDiffThreshold = 0
#hashDiffThreshold = 1
#hashDiffThreshold = 5
loopRange = range(outputCodingAll.shape[0])
samePairList = []

#This is O(n!) I suppose, is there better way ? 
for i in loopRange:
    for j in loopRange:
        if j > i:
            if (sum(abs(outputCodingAll[i,] - outputCodingAll[j,])) <= hashDiffThreshold):
                print("The pair (",  str(i), ", ", str(j), ") ")
                samePairList.append([i, j])

print("Following pairs are considered the same given the threshold ", hashDiffThreshold)
print(samePairList)

[[22, 23], [22, 54], [22, 62], [22, 70], [22, 93], [23, 54], [23, 62], [23, 70], [23, 93], [54, 62], [54, 70], [54, 93], [62, 70], [62, 93], [70, 93]]