I have two numpy arrays that look like
a = np.array([[1,2], [3,4], [5,6], [7,8]])
b = np.array([[1,2], [3,3], [5,6], [8,7]])
I want to filter to rows that are in array a, and not in array b. So my output should look like:
in_a_and_not_b = np.array([[3,4], [7,8]])
I have some nasty code to do this right now:
in_a_and_not_b = []
current_row = 0
for row_a in a:
include_current_row = True
for row_b in b:
if np.array_equal(row_a, row_b):
include_current_row = False
if include_current_row:
in_a_and_not_b.append(a[current_row])
current_row = 1
My problem is that this takes forever. Is there a more numpy-thonic way to do this that will take less time?
In reality, my arrays a and b are large, around (50000, 2) in shape each.
CodePudding user response:
If you convert arrays to list of tuples
then you can do set(tuples_a) - set(tuples_b)
Minimal working example
import numpy as np
a = np.array([[1,2], [3,4], [5,6], [7,8]])
b = np.array([[1,2], [3,3], [5,6], [8,7]])
tuples_a = [tuple(x) for x in a.tolist()]
tuples_b = [tuple(x) for x in b.tolist()]
print( set(tuples_a) - set(tuples_b) )
Result:
{(3, 4), (7, 8)}
And next you can convert it to numpy.array.
But if you need to keep order then it may have problem.