Home > Software engineering >  How to filter two numpy arrays?
How to filter two numpy arrays?

Time:11-07

I don't understand much about programing but I have a giant mass of data to analyze and it has to be done in Python. Say I have two arrays:

import numpy as np
x=np.array([1,2,3,4,5,6,7,8,9,10])
y=np.array([25,18,16,19,30,5,9,20])

and say I want to choose the values in y which are greater than 17, and keep only the values in x which has the same index as the left values in y. for example I want to erase the first value of y (25) and accordingly the matching value in x (1). I tried this:

filter=np.where(y>17, 0, y)

but I don't know how to filter the x values accordingly (the actual data are much longer arrays so doing it "by hand" is basically imposible)

CodePudding user response:

As your question is not fully clear and you did not provide the expected output, here are two possibilities:

filtering

Nunique arrays can be sliced by an array (iterable) of booleans.

If the two arrays were the same length you could do:

x[y>17]

Here, xis longer than y so we first need to make it the same length:

import numpy as np
x=np.array([1,2,3,4,5,6,7,8,9,10])
y=np.array([25,18,16,19,30,5,9,20])

x[:len(y)][y>17]

Output: array([1, 2, 4, 5, 8])

replacement

To select between x and y based on a condition, use where:

np.where(y>17, x[:len(y)], y)

Output:

array([ 1,  2, 16,  4,  5,  5,  9,  8])

CodePudding user response:

As someone with little experience in Numpy specifically, I wrote this answer before seeing @mozway's excellent answer for filtering. My answer works on more generic containers than Numpy's arrays, though it uses more concepts as a result. I'll attempt to explain each concept in enough detail for the answer to make sense.

TL;DR:

Please, definitely read the rest of the answer, it'll help you understand what's going on.

import numpy as np
x = np.array([1,2,3,4,5,6,7,8,9,10])
y = np.array([25,18,16,19,30,5,9,20])

filtered_x_list = []
filtered_y_list = []

for i in range(min(len(x), len(y))):
    if y[i] > 17:
        filtered_y_list.append(y[i])
        filtered_x_list.append(x[i])

filtered_x = np.array(filtered_x_list)
filtered_y = np.array(filtered_y_list)

# These lines are just for us to see what happened
print(filtered_x) # prints [1 2 4 5 8]
print(filtered_y) # prints [25 18 19 30 20]

Pre-requisite Knowledge

Python containers (lists, arrays, and a bunch of other stuff I won't get into)

Lets take a look at the line:

x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

What's Python doing?

The first thing it's doing is creating a list:

[1, 2, 3] # and so on

Lists in Python have a few features that are useful for us in this solution:

Accessing elements:

x_list = [ 1, 2, 3 ]
print(x_list[0]) # prints 1
print(x_list[1]) # prints 2, and so on

Adding elements to the end:

x_list = [ 1, 2, 3 ]
x_list.append(4)
print(x_list) # prints [1, 2, 3, 4]

Iteration:

x_list = [ 1, 2, 3 ]
for x in x_list:
    print(x)

# prints:
# 1
# 2
# 3

Numpy arrays are slightly different: we can still access and iterate elements in them, but once they're created, we can't modify them - they have no .append, and there are other modifications one can do with lists (like changing one value, or deleting a value) we can't do with numpy arrays.

So the filtered_x_list and the filtered_y_list are empty lists we're creating, but we're going to modify them by adding the values we care about to the end.

The second thing Python is doing is creating a numpy array, using the list to define its contents. The array constructor can take a list expressed as [...], or a list defined by x_list = [...], which we're going to take advantage of later.

A little more on iteration

In your question, for every x element, there is a corresponding y element. We want to test something for each y element, then act on the corresponding x element, too.

Since we can access the same element in both arrays using an index - x[0], for instance - instead of iterating over one list or the other, we can iterate over all indices needed to access the lists.

First, we need to figure out how many indices we're going to need, which is just the length of the lists. len(x) lets us do that - in this case, it returns 10.

What if x and y are different lengths? In this case, I chose the smallest of the two - first, do len(x) and len(y), then pass those to the min() function, which is what min(len(x), len(y)) in the code above means.

Finally, we want to actually iterate through the indices, starting at 0 and ending at len(x) - 1 or len(y) - 1, whichever is smallest. The range sequence lets us do exactly that:

for i in range(10):
    print(i)
    
# prints:
# 0
# 1
# 2
# 3
# 4
# 5
# 6
# 7
# 8
# 9

So range(min(len(x), len(y))), finally, gets us the indices to iterate over, and finally, this line makes sense:

for i in range(min(len(x), len(y))):

Inside this for loop, i now gives us an index we can use for both x and y.

Now, we can do the comparison in our for loop:


for i in range(min(len(x), len(y))):
    if y[i] > 17:
        filtered_y_list.append(y[i])

Then, including xs for the corresponding ys is a simple case of just appending the same x value to the x list:

for i in range(min(len(x), len(y))):
    if y[i] > 17:
        filtered_y_list.append(y[i])
        filtered_x_list.append(x[i])

The filtered lists now contain the numbers you're after. The last two lines, outside the for loop, just create numpy arrays from the results:


filtered_x = np.array(filtered_x_list)
filtered_y = np.array(filtered_y_list)

Which you might want to do, if certain numpy functions expect arrays.

While there are, in my opinion, better ways to do this (I would probably write custom iterators that produce the intended results without creating new lists), they require a somewhat more advanced understanding of programming, so I opted for something simpler.

  • Related