How to filter a numpy array of points by another array-CodePudding

how can I filter a numpy array a, by the elements of a numpy array b so that I get all the points in a that are not in b.

import numpy as np

a = np.array([[1,2],[1,3],[1,4]])
b = np.array([[1,2],[1,3]])
c = np.array([ d for d in a if d not in b])
print(c)

# acutall outcome
# []
# desired outcome
# np.array([[1,4]])```

CodePudding user response：

This probably will not be the most efficient (though it turns out to be faster than the other approaches presented here for this input -- see below), but one thing you can do is convert a and b to Python lists and then take their set difference:

# Method 1
tmp_1 = [tuple(i) for i in a]    # -> [(1, 2), (1, 3), (1, 4)]
tmp_2 = [tuple(i) for i in b]    # -> [(1, 2), (1, 3)]

c = np.array(list(set(tmp_1).difference(tmp_2)))

As noted by @EmiOB, this post offers some insights into why [ d for d in a if d not in b ] in your question does not work. Drawing from that post, you can use

# Method 2
c = np.array([d for d in a if all(any(d != i) for i in b)])

Remarks

The implementation of array_contains(PyArrayObject *self, PyObject *el) (in C) says that calling array_contains(self, el) (in C) is equivalent to

(self == el).any()

in Python, where self is a pointer to an array and el is a pointer to a Python object.

In other words:

if arr is a numpy array and obj is some arbitrary Python object, then

obj in arr

is the same as

(arr == obj).any()

if arr is a typical Python container such as a list, tuple, dictionary, and so on, then

obj in arr

is the same as

any(obj is _ or obj == _ for _ in arr)

(see membership test operations).

All of which is to say, the meaning of obj in arr is different depending on the type of arr.

This explains why the logical comprehension that you proposed [d for d in a if d not in b] does not have the desired effect.

This can be confusing because it is tempting to reason that since a numpy array is a sequence (though not a standard Python one), test membership semantics should be the same. This is not the case.

Example:

a = np.array([[1,2],[1,3],[1,4]])
print((a == [1,2]).any())          # same as [1, 2] in a
# outputs True

Timings

For your input, I found my approach to be the fastest, followed by Method 2 obtained from the post @EmiOB suggested, followed by @DanielF's approach. I would not be surprised if changing the input size changes the ordering of the timings so take them with a grain of salt.

# Method 1
5.96 µs ± 8.92 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# Method 2
6.45 µs ± 27.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# @DanielF's answer
16.5 µs ± 276 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

CodePudding user response：

When comparing row-wise like this I tend to use @Jaime's recipe for converting to a void view here :

vview = lambda a:np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))

a[~np.isin(vview(a), vview(b)).squeeze()]
Out[]: array([[1, 4]])

This avoids the slow for loops of the other answers and doesn't create any intermediate data structures.

CodePudding user response：

Use This:

c = np.array([a_elem for a_elem in a if all(any(a_elem != b_elem) for b_elem in b)])

Output:

array([[1, 4]])

Explanation:

We loop for a sublist a_elem from a and check for all sublists from b. any(a_elem != b_elem) returns True if any value from a_elem is not equal to b_elem. all(any(a_elem != b_elem) for b_elem in b) returns True if all sublists are unequal.

Eg:

We take [1,2] from a check if any of its elements are unequal to [1,2], [1,3] from b one by one. So, it'll be False for [1,2] and True for [1,3]. This creates a list [False, True]

Next, we take [1,3] from a. It'll return True for [1,2] and False for [1,3]. This creates another list [True, False].

Lastly, we take [1,4] from a. It'll return True for both [1,2] and [1,3]. This creates a list [True, True]

Now, when we run all() it returns True when both values are True in the above lists. Hence, we add [1,4] to our array.