Home > Back-end >  Why is there no 'is' ufunc in numpy?
Why is there no 'is' ufunc in numpy?

Time:11-05

I can certainly do

a[a == 0] = something

that sets every entry of a that equals zero to something. Equivalently, I could write

a[np.equal(a, 0)] = something

Now, imagine a is an array of dtype=object. I cannot write a[a is None] because, of course, a itself isn't None. The intention is clear: I want the comparison is to be broadcast like any other ufunc. This list from the docs lists nothing like an is-unfunc.

Why is there none, and, more interestingly to me: what would be a performant replacement?

CodePudding user response:

There are two things at play here.

The first (and more important) one is that is is implemented directly in the Python interpreter with no option to redirect to a dunder method. Numpy arrays, like many other objects, have an __eq__ method that implements the == operation. a is None is treated approximately as id(a) == id(None), with no recourse for an elementwise implementation under any circumstance. That's just how python works.

The second aspect is that numpy is fundamentally designed for storing numbers. Object arrays are special cases that store references to objects as a number. This appears to be the same as how lists store object references, but it's only similar when dealing with references. The elements of a list are always references to objects, even when the list contains homogeneous integers, for example. A numpy array of dtype int does not contain python objects. Each consecutive element of the array is a raw binary integer, not a reference to a python object wrapper. Even if python allowed you to override the is operator, it would be meaningless to apply elementwise.

So if you want to compare objects, use python lists:

mylist = [...]
mylist = [something if x is None else x for x in mylist]

If you insist on using a numpy array, either (a) use numerical arrays and mark None elements with something else, like np.nan, or (b) treat the array as a list. You will have to apply id or is to each element, which are python constructs, so there is no "performant" way to do it at that point, or (c) just use ==, which will trigger python-level equality comparison, which is equivalent to is for the singleton None.

CodePudding user response:

Except for operations like reshape and indexing that don't depend on dtype (except for the itemsize), operations on object dtype arrays are performed at list-comprehension speeds, iterating on the elements and applying an appropriate method to each. Sometimes that method doesn't exist, such as when doing np.sin.

To illustrate, consider the array from one of the comments:

In [132]: a = np.array([1, None, 0, np.nan, ''])
In [133]: a
Out[133]: array([1, None, 0, nan, ''], dtype=object)

The object array test:

In [134]: a==None
Out[134]: array([False,  True, False, False, False])
In [135]: timeit a==None
5.16 µs ± 73.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

An equivalent comprehension:

In [136]: [x is None for x in a]
Out[136]: [False, True, False, False, False]
In [137]: timeit [x is None for x in a]
1.52 µs ± 18.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

It's faster, even if we cast the result back to array (not a cheap step):

In [138]: timeit np.array([x is None for x in a])
4.67 µs ± 95.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Iteration on the list version of the array is even faster:

In [139]: timeit np.array([x is None for x in a.tolist()])
2.52 µs ± 48.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Let's look at the full assignment action:

In [141]: a[[x is None for x in a.tolist()]]
Out[141]: array([None], dtype=object)
In [142]: %%timeit a1=a.copy()
     ...: a1[[x is None for x in a1.tolist()]] = np.nan
     ...: 
     ...: 
4.03 µs ± 10 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [143]: %%timeit a1=a.copy()
     ...: a1[a1==None] = np.nan
     ...: 
     ...: 
6.18 µs ± 28.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The usual caveat that things might scale differently.

CodePudding user response:

numpy has no None datatype. And if a is None your first problem would be the part you try to get an element from None:

a[a is None]

However you can have a nan (Not A Number) and you can check it using isnan. See: https://numpy.org/doc/stable/reference/generated/numpy.isnan.html

  • Related