how can I filter a numpy array a, by the elements of a numpy array b so that I get all the points in a that are not in b.
import numpy as np
a = np.array([[1,2],[1,3],[1,4]])
b = np.array([[1,2],[1,3]])
c = np.array([ d for d in a if d not in b])
print(c)
# acutall outcome
# []
# desired outcome
# np.array([[1,4]])```
CodePudding user response:
This probably will not be the most efficient (though it turns out to be faster than the other approaches presented here for this input -- see below), but one thing you can do is convert a
and b
to Python lists and then take their set difference:
# Method 1
tmp_1 = [tuple(i) for i in a] # -> [(1, 2), (1, 3), (1, 4)]
tmp_2 = [tuple(i) for i in b] # -> [(1, 2), (1, 3)]
c = np.array(list(set(tmp_1).difference(tmp_2)))
As noted by @EmiOB, this post offers some insights into why [ d for d in a if d not in b ]
in your question does not work. Drawing from that post, you can use
# Method 2
c = np.array([d for d in a if all(any(d != i) for i in b)])
Remarks
The implementation of array_contains(PyArrayObject *self, PyObject *el)
(in C) says that calling array_contains(self, el)
(in C) is equivalent to
(self == el).any()
in Python,
where self
is a pointer to an array and el
is a pointer to a Python object.
In other words:
- if
arr
is a numpy array andobj
is some arbitrary Python object, then
obj in arr
is the same as
(arr == obj).any()
- if
arr
is a typical Python container such as a list, tuple, dictionary, and so on, then
obj in arr
is the same as
any(obj is _ or obj == _ for _ in arr)
(see membership test operations).
All of which is to say, the meaning of obj in arr
is different depending on the type of arr
.
This explains why the logical comprehension that you proposed [d for d in a if d not in b]
does not have the desired effect.
This can be confusing because it is tempting to reason that since a numpy array is a sequence (though not a standard Python one), test membership semantics should be the same. This is not the case.
Example:
a = np.array([[1,2],[1,3],[1,4]])
print((a == [1,2]).any()) # same as [1, 2] in a
# outputs True
Timings
For your input, I found my approach to be the fastest, followed by Method 2 obtained from the post @EmiOB suggested, followed by @DanielF's approach. I would not be surprised if changing the input size changes the ordering of the timings so take them with a grain of salt.
# Method 1
5.96 µs ± 8.92 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# Method 2
6.45 µs ± 27.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# @DanielF's answer
16.5 µs ± 276 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
CodePudding user response:
When comparing row-wise like this I tend to use @Jaime's recipe for converting to a void view here :
vview = lambda a:np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
a[~np.isin(vview(a), vview(b)).squeeze()]
Out[]: array([[1, 4]])
This avoids the slow for
loops of the other answers and doesn't create any intermediate data structures.
CodePudding user response:
Use This:
c = np.array([a_elem for a_elem in a if all(any(a_elem != b_elem) for b_elem in b)])
Output:
array([[1, 4]])
Explanation:
We loop for a sublist a_elem
from a
and check for all sublists from b
. any(a_elem != b_elem)
returns True
if any value from a_elem
is not equal to b_elem
. all(any(a_elem != b_elem) for b_elem in b)
returns True if all sublists are unequal.
Eg:
We take [1,2]
from a
check if any of its elements are unequal to [1,2]
, [1,3]
from b
one by one. So, it'll be False
for [1,2]
and True
for [1,3]
. This creates a list [False, True]
Next, we take [1,3]
from a
. It'll return True
for [1,2]
and False
for [1,3]
. This creates another list [True, False]
.
Lastly, we take [1,4]
from a
. It'll return True
for both [1,2]
and [1,3]
. This creates a list [True, True]
Now, when we run all()
it returns True
when both values are True
in the above lists. Hence, we add [1,4]
to our array.