Home > Software engineering >  How exactly to work with string arrays in numba?
How exactly to work with string arrays in numba?

Time:09-29

We've been trying to speed up our API call. We used to use pandas before, then moved to numpy (as we believe it's faster than pandas) and now we're applying numba to it in order to speed it up even further. I have managed to apply numba to my numeric arrays really well, but I'm still struggling with the string (nominal) array. Can't find the answers I need on numba's website nor here on StackOverf.

Below I show you a simple version of the function we want to have with the type of procedures I'm performing on my string array, very simple stuff. I've looked for solutions around and many posts here state that numba is now working with strings so I believe there could be a solution to my code since I'm only using pretty simple data manipulations.

# Loading packages
import numpy as np
from numba import jit

# Versions
# python 3.9.12
# numba  0.55.1
# numpy  1.21.5

# Creating a toy array
input_array = np.array([np.nan,'C','P'], dtype="<U11")
print(input_array) # ['nan' 'C' 'P' 'nan']

# Starting with the python version of the code to show what the aim is:
def foo_python(input_array):
    
    # Creating output array
    output_array = np.empty(shape=3, dtype="float32")
    
    # 1st procedure - Replace missings by "Missing"
    input_array[input_array == 'nan'] = "Miss"
    
    # 2nd procedure - map strings to numbers
    output_array[0] = {"False": -0.01960485, "True": 1.1470174, "Miss": -1.0}.get(
        str(input_array[0]), input_array[0]
    )    
    
    # 3rd procedure - checking if value belongs to a list of strings
    input_array[1] = np.where(input_array[1] in ['A','B','C'], input_array[1], 'Other')
    
    # 4th procedure - creating a dummy version of the cell
    output_array[2] = np.where(input_array[2] == 'K', 1, 0)

    return output_array

      
foo_python(input_array) # array([-1.0000000e 00, -2.6711958e 07,  0.0000000e 00], dtype=float32)

# Numba version:
@jit(nopython=True)
def foo_numba(input_array):

    # Creating output array
    output_array = np.empty(shape=3, dtype="float32")

    # 1st procedure - Replace missings by "Missing"
    input_array[input_array == "nan"] = "Miss"

    # 2nd procedure- map strings to numbers
    output_array[0] = {"False": -0.01960485, "True": 1.1470174, "Miss": -1.0}.get(
        str(input_array[0]), input_array[0]
    )

    # 3rd procedure - checking if value belongs to a list of strings
    input_array[1] = np.where(
        input_array[1] in ["A", "B", "C"], input_array[1], "Other"
    )

    # 4th procedure - creating a dummy version of the cell
    output_array[2] = np.where(input_array[2] == "K", 1, 0)

    return output_array


foo_numba(input_array)

When I apply @jit(nopython=True) on top of it, these are the errors I got:

For the 1st procedure:

TypingError: No implementation of function Function() found for signature:

setitem(array([unichr x 11], 1d, C), Literalbool, Literalstr)

For the 2nd procedure:

TypingError: - Resolution failure for literal arguments: No implementation of function Function(<function impl_get at 0x000002926D59C310>) found for signature:

impl_get(DictType[unicode_type,float64]<iv=None>, unicode_type, [unichr x 11])

For the 3rd procedure:

TypingError: No implementation of function Function(<function where at 0x0000029264686C10>) found for signature:

where(bool, [unichr x 11], Literalstr)

For the 4th procedure:

It works! So I believe the error on 3rd procedure may not be np.where, but rather on the type difference between input_array[1] and 'Other'?

I've tried to replace the first procedure by a for loop but is that really the best or only solution?

CodePudding user response:

The given code can be made to run with numba.

  1. The 1st procedure has a typing issue in the conditional indexing but it can be worked around by using a loop instead of conditional indexing.
  2. The 2nd procedure relies on a python dict which is not fully supported by numba due to its lack of strict typing. Numba has its own implementation of a typed Dict that can be used but is more cumbersome.
  3. The 3rd procedure has a similar typing issue in np.where() but can be worked around by modifying the code a bit.
# Loading packages
import numpy as np
from numba import jit
from numba.core import types
from numba.typed import Dict

# Versions
# python 3.9.12
# numba  0.55.1
# numpy  1.21.5

# Creating a toy array
input_array = np.array([np.nan,'C','P'], dtype='<U11')
print(input_array) # ['nan' 'C' 'P' 'nan']

# Starting with the python version of the code to show what the aim is:
def foo_python(input_array):
    
    # Creating output array
    output_array = np.empty(shape=3, dtype="float32")
    
    # 1st procedure - Replace missings by "Missing"
    input_array[input_array == 'nan'] = "Miss"
    
    # 2nd procedure - map strings to numbers
    output_array[0] = {"False": -0.01960485, "True": 1.1470174, "Miss": -1.0}.get(
        str(input_array[0]), input_array[0]
    )
    
    # 3rd procedure - checking if value belongs to a list of strings
    input_array[1] = np.where(input_array[1] in ['A','B','C'], input_array[1], 'Other')
    
    # 4th procedure - creating a dummy version of the cell
    output_array[2] = np.where(input_array[2] == 'K', 1, 0)

    return output_array

foo_python(input_array) # array([-1.0000000e 00, -2.6711958e 07,  0.0000000e 00], dtype=float32)

# Numba version:
@jit(nopython=True)
def foo_numba(input_array):

    # Creating output array
    output_array = np.empty(shape=3, dtype="float32")

    # 1st procedure - Replace missings by "Missing"
    for i,s in enumerate(input_array):
        if s == "nan":
            input_array[i] = "Miss"

    # 2nd procedure- map strings to numbers
    d = Dict.empty(
        key_type=types.unicode_type,
        value_type=types.float64,
    )
    d["False"] = -0.01960485
    d["True"] = 1.1470174
    d["Miss"] = -1.0

    output_array[0] = 0 # must have float32 type
    for k in d.keys():
      if input_array[0] == k:
        output_array[0] = d[k]

    # 3rd procedure - checking if value belongs to a list of strings
    if input_array[1] not in ['A','B','C']:
        input_array[1] = "Other"

    # 4th procedure - creating a dummy version of the cell
    output_array[2] = np.where(input_array[2] == "K", 1, 0)

    return output_array

foo_numba(input_array)

Typed Dict reference:

https://numba.pydata.org/numba-doc/dev/reference/pysupported.html#typed-dict

  • Related