Here is what I have:
long_list = a very long list of integer values (6M entries)
wanted_list = a list of integer values that are of interest (70K entries)
What I need:
mask_list = a list of booleans of the same length as long_list, describing whether each element of long_list in present in the wanted_list (i.e. [is long_list[0] in wanted_list?, is long_list[1] in wanted_list?,....]). The number of'True' entries in this list should be the same as len(wanted_list)
I got a working code using a for loop, but as expected it's way too slow for the length of the lists that I am working with (takes several minutes to run):
masklist = []
for element in long_list:
if element in wanted_list:
masklist.append(True)
else:
masklist.append(False)
I was wondering if there is a more elegant and fast way to achieve this goal? I was looking into numpy.ma module, but could not think of an elegant way to apply it to this problem
CodePudding user response:
You can use numpy.isin
for this:
masklist = np.isin(long_list, wanted_list)
CodePudding user response:
This is slow because the algorithm has a poor complexity (O(n²)
). This can be done in O(n)
:
wanted_set = set(wanted_list)
masklist = [element in wanted_set for element in long_list]
If you operate on Numpy array, then you can use np.unique
and np.searchsorted
but it will be computed in O(n log n)
time. It may be faster for small arrays (due to the Numpy implementation), but not for huge ones (due to the complexity and inefficient cache accesses). np.isin
may also be useful and maybe faster.