Comparing 2D boolean arrays-CodePudding

I am working on a problem where I need to compare 1 particular array to hundreds of thousands of others and return a list of results showing how similar they are to each other, I read up that numpy was probably the best library to go about working with arrays (if there's anything better please let me know:) so I scribbled this, but it's still slow. I am not the best at programming so any help to improve this would be immensely appreciated!

import numpy as np

list_of_arrays = [np.random.randint(0, 2, (30, 30)) for array in range(100000)]
base_array = np.random.randint(0, 2, (30, 30))
results = []

for array in list_of_arrays:
    results.append(np.sum(np.equal(base_array, array)))

CodePudding user response：

You can use numpy broadcasting magic to do it in one list without list comprehension or loops of any kind:

results = np.equal(base_array, list_of_arrays).sum(axis=1).sum(axis=1)

You have so many arrays that it can't get much faster ;)

CodePudding user response：

There are a number of efficient tricks for doing this in numpy. None of them require explicit loops or appending to a list.

First, make the list into an array:

list_of_arrays = np.random.randint(0, 2, (100000, 30, 30), dtype=bool)

Notice how much simpler (and faster) that is. Now make a boolean base:

base_array = np.random.randint(0, 2, (30, 30), dtype=bool)

The simplest comparison makes direct use of broadcasting:

results = (base_array == list_of_arrays).sum((1, 2))

The equality of two booleans can also be obtained from their XOR:

results = (~base_array ^ list_of_arrays).sum((1, 2))

Running ~ on base_array is much faster than doing it on list_of_arrays or the result of the XOR and has the same logical effect.

You can simplify the sum by raveling the last dimensions:

results = (base_array.ravel() == list_of_arrays.reshape(100000, -1)).sum(-1)