Mask numpy array based on pandas conditional-CodePudding

I have an array of x and y values (same length)

x = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8])
y = np.array([3, 4, 2, 6, 2, 3, 2, 10, 6, 4, 2, 3, 1, 8, 3, 1, 6, 4])

I have a separate dataframe

df = pd.DataFrame({'Time': [0.3, 1.1], 'Duration': [0.2, 0.4]})

I want to zero the values of y where corresponding indexes of x fall between df['Time'][i] <= x < df['Time'][i] df['Duration'][i] (for any i) yielding the following:

y_out = np.array([3, 4, 0, 0, 2, 3, 2, 10, 6, 4, 0, 0, 0, 0, 3, 1, 6, 4])

Note: I have to do this on millions of points, so it has to be relatively fast...

CodePudding user response：

You can use np.greater_equal's outer function to make this vectorized.

mask = (np.greater_equal.outer(x, df['Time'].to_numpy()) 
        & np.less.outer(x, (df['Time']   df['Duration']).to_numpy())).any(1)

Then simply

 y[mask] = 0

Using the outer product means that you will, in a vectorized way, compare all values of your array x with all values of your rows in df. This is fast, but costly in terms of memory.

Consider partitioning the processing in chunks, in case the whole operation doesn't fit in memory.

CodePudding user response：

I would use logical operations np.multiply and then map like this:

np.multiply(y, ((x < record['Time']) | (x > record['Time']   record['Duration'])))

here is a working example: https://abstra.show/4qgrdKVzLP

reference: