Home > Software engineering >  Efficiently filter out matching rows between two 2-D numpy arrays
Efficiently filter out matching rows between two 2-D numpy arrays

Time:10-07

I have two numpy arrays that look like

a = np.array([[1,2], [3,4], [5,6], [7,8]])
b = np.array([[1,2], [3,3], [5,6], [8,7]])

I want to filter to rows that are in array a, and not in array b. So my output should look like:

in_a_and_not_b = np.array([[3,4], [7,8]])

I have some nasty code to do this right now:

in_a_and_not_b = []
current_row = 0
for row_a in a:
    include_current_row = True
    for row_b in b:
        if np.array_equal(row_a, row_b):
            include_current_row = False
    if include_current_row:
        in_a_and_not_b.append(a[current_row])
    current_row  = 1

My problem is that this takes forever. Is there a more numpy-thonic way to do this that will take less time?

In reality, my arrays a and b are large, around (50000, 2) in shape each.

CodePudding user response:

If you convert arrays to list of tuples then you can do set(tuples_a) - set(tuples_b)


Minimal working example

import numpy as np

a = np.array([[1,2], [3,4], [5,6], [7,8]])
b = np.array([[1,2], [3,3], [5,6], [8,7]])

tuples_a = [tuple(x) for x in a.tolist()]
tuples_b = [tuple(x) for x in b.tolist()]

print( set(tuples_a) - set(tuples_b) )

Result:

{(3, 4), (7, 8)}

And next you can convert it to numpy.array.

But if you need to keep order then it may have problem.

  • Related