Home > database >  Iterate over `True` entries of Boolean Numpy array
Iterate over `True` entries of Boolean Numpy array

Time:12-25

I want a loop for each index i at which array X (which is Boolean) is True.

Is there something more efficient/pythonic than wrapping np.nonzero inside np.nditer as follows?

for i in np.nditer(np.nonzero(X), flags=['zerosize_ok']):
    myfunction(Y[i],Z2[Z[i]])

The problem here is that it iterates twice instead of just once, and occupies memory (first, np.nonzero iterates through X and stores that to a big array, then np.nditer iterates through that array).

Is there a command (slightly similar to np.nditer, so to speak) for efficiently iterating over True entries of a Boolean array directly, without listing them all explicitly with np.nonzero first? (Iterating over all entries and checking each with an if statement is probably less efficient than some iterator offered by Numpy, if it exists.)

CodePudding user response:

People downvote because looping over the entries of a numpy array is a big nono. We are using numpy because it's fast and treating every element by itself rather than operating on the whole array at once makes it so you're getting python level performance rather than numpy/c performance.

Wanting to exclude values by giving an array with true and false values is very common and is called masking. you can do it by indexing into the true false array. To do that in numpy you use indexing. E.g. you do np.array([1,2,3])[np.array([True,False,True])]. And it gives you np.array([1, 3]).

So basically try arranging things in a way that you can do

myfunction(Y[mask],Z2[Z[maks]]).

There are a couple of techniques of doing that. One way is only using numpy functions to create myfunction and an other one is to use decorators like numba.vectorize, numba.guvectorize or numba.njit and a few more.

CodePudding user response:

How about using numpy.vectorise with an boolean selector as an index?

np.vectorise takes a function and returns a vectorised version of that function that accepts arrays. You can then run the function on the array or a subset of it using a selector.

In numpy you can subselect using an index of an array using a list of numbers.

The numpy.where function returns the indexes of an array that match some value or function.

Putting that together:

import numpy as np
import random

random.seed(0) # For repeatability

def myfunction(y, z, z2):
    return y*z2[z]

Y = np.array(range(100))  # Array of any length
Z = np.array(range(len(Y)))  # Must be the same length as Y
Z2 = np.array(range(len(Y))) # Must be constrained to indexes of X, Y and Z
X = [random.choice([True, False]) for toss in range(len(Y))] # Throw some coins

# X is now an array of True/False
# print( "\n".join(f"{n}: {v}" for n, v in enumerate(X)))
print(np.where(X)) # Where gets the indexes of True items

# Vectorise our function
vfunc = np.vectorize(myfunction, excluded={2}) # We exclude {2} as it's the Z2 array

# Do the work
res = vfunc(Y[np.where(X)],Z[np.where(X)],Z2)

# Check our work
for outpos, inpos in enumerate(np.where(X)[0]):
    assert myfunction(Y[inpos], Z[inpos], Z2) == res[outpos], f"Mismatch at in={inpos}, out={outpos}"

# Print the results
print(res)
  • Related