I am working on a problem where I need to compare 1 particular array to hundreds of thousands of others and return a list of results showing how similar they are to each other, I read up that numpy was probably the best library to go about working with arrays (if there's anything better please let me know:) so I scribbled this, but it's still slow. I am not the best at programming so any help to improve this would be immensely appreciated!
import numpy as np
list_of_arrays = [np.random.randint(0, 2, (30, 30)) for array in range(100000)]
base_array = np.random.randint(0, 2, (30, 30))
results = []
for array in list_of_arrays:
results.append(np.sum(np.equal(base_array, array)))
CodePudding user response:
You can use numpy broadcasting magic to do it in one list without list comprehension or loops of any kind:
results = np.equal(base_array, list_of_arrays).sum(axis=1).sum(axis=1)
You have so many arrays that it can't get much faster ;)
CodePudding user response:
There are a number of efficient tricks for doing this in numpy. None of them require explicit loops or appending to a list.
First, make the list into an array:
list_of_arrays = np.random.randint(0, 2, (100000, 30, 30), dtype=bool)
Notice how much simpler (and faster) that is. Now make a boolean base:
base_array = np.random.randint(0, 2, (30, 30), dtype=bool)
The simplest comparison makes direct use of broadcasting:
results = (base_array == list_of_arrays).sum((1, 2))
The equality of two booleans can also be obtained from their XOR:
results = (~base_array ^ list_of_arrays).sum((1, 2))
Running ~
on base_array
is much faster than doing it on list_of_arrays
or the result of the XOR and has the same logical effect.
You can simplify the sum by raveling the last dimensions:
results = (base_array.ravel() == list_of_arrays.reshape(100000, -1)).sum(-1)