Why does numba work on numpy string vectors but not on numpy strings?-CodePudding

Consider the simplest possible function

@numba.jit
def foo(s1):
    return s1

Now constructing an array of np.bytes_ objects

> a = np.array(['abc']*5, dtype='S5')
> a
array([b'abc', b'abc', b'abc', b'abc', b'abc'], dtype='|S5')

Why does calling foo with the vector work:

> foo(a)
array([b'abc', b'abc', b'abc', b'abc', b'abc'], dtype='|S5')

But calling foo with a single element raises an exception

> foo(a[0])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_8124/2559272744.py in <module>
----> 1 foo(a[0])

TypeError: bad argument type for built-in operation

(This is running numba 0.54.1 from conda-forge on Windows with Python 3.9.7 and numpy 1.20.3)

CodePudding user response：

Neither bytes nor np.bytes_ types are listed in the set of types supported by numba as of the latest release. The closest things it supports would be:

Character sequences (read: str) (though it specifically says "no operations are available on them", so this is pretty useless); your function would work if you called foo(a[0].decode()) to make it text (but only because it's a pretty useless function)
Actual numpy arrays; the cost to view the bytes/np.bytes_ as an np.array is pretty low, so you could just do: foo(np.frombuffer(a[0], np.uint8)) and produce something that is more programmatically useful and represents the same data.

CodePudding user response：

The bytes type is barely supported like the str type. They are very inefficiently supported and the support is minimalist. Moreover, there are some opened related bugs (like this one. Furthermore, AFAIK, there is no plan to work on this any time soon.

From my understanding, a[0] returns a numpy.bytes_-typed object which is not completely compatible with bytes (at least for Numba). Compiling the function with numpy.bytes_ appear to cause a bug that makes Numba being confused between numpy.bytes_ and bytes (Numba try to use a compiled function with the wrong type).

Indeed, the following code works:

@numba.jit
def foo(s1):
    return s1

foo(b'test')      # Works
foo(bytes(a[0]))  # Works

The following code fails:

@numba.jit
def foo(s1):
    return s1

foo(a[0])         # Fail and cause a bug
foo(bytes(a[0]))  # Now fail (do not recompile the function properly)
foo(b'test')      # Also fail (do not recompile the function properly)

Note that the bytes type is only supported in read-only mode.