Home > OS >  Why does numpy's `np.char.encode` turn an empty unicode array into an empty `float64` array?
Why does numpy's `np.char.encode` turn an empty unicode array into an empty `float64` array?

Time:07-20

I have an empty unicode array:

a = np.array([], dtype=np.str_)

I want to encode it:

b = np.char.encode(a, encoding='utf8')

Why is the result an empty array with dtype=float64?

# array([], dtype=float64)

If the array is not empty the resulting array is a properly encoded array with dtype=|S[n]:

a = np.array(['ss', 'ff☆'], dtype=np.str_)
b = np.char.encode(a, encoding='utf8')
# array([b'ss', b'ff\xe2\x98\x86'], dtype='|S5')

EDIT: The accepted answer below does, in fact, answer the question as posed but if you come here looking for a workaround, here is what I did:

if array.size == 0:
    encoded_array = np.chararray((0,))
else:
    encoded_array = np.char.encode(a, encoding='utf8')

This will produce an empty encoded array with dtype='|S1' if your decoded array is empty.

CodePudding user response:

The source of numpy.char.encode is available here. It basically calls _vec_string which returns an empty array of type np.object_ in this case. This result is provided to _to_string_or_unicode_array which builds the final array and determine its type. Its code is available here. It basically converts the Numpy array to a list so to then provide it to np.asarray. The goal of this operation is to determine the type of the array but the thing is that empty arrays have a default type of np.float64 by convention (I think it is because Numpy was initially design for physicist who usually work with np.float64 arrays). This result is quite unexpected in this context, but the "S0" does not exists and I am not sure everyone would agree that the "S1" type is better here (still, it is certainly better than np.float64). Feel free to fill an issue on the GitHub Numpy repository so to start a discussion about this behaviour.

  • Related