Why is zero-terminated bytes is not recommended in numpy.array for string?-CodePudding

As per NumPy offical doc:

Array-protocol type strings (see The Array Interface)
...

'S', 'a' zero-terminated bytes (not recommended)

'U' Unicode string

...
For backward compatibility with Python 2 the S and a typestrings remain zero-terminated bytes and numpy.string_ continues to alias numpy.bytes_. To use actual strings in Python 3 use U or numpy.str_. For signed bytes that do not need zero-termination b or i1 can be used. [emphasis added]

Two questions:

If I understand correctly, dt = np.dtype("S25") # 25-length zero-terminated bytes (example from doc) should only hold 24 valid chars(bytes), since the last one is a null-terminator. Consider the code

>>> import numpy as np
>>> arr = np.array(["aa", "bb"], dtype="S2") # bytes
>>> arr.data.tobytes()
b'aabb'
>>> arr.data.nbytes
4
>>> arr = np.array(["aa", "bb"], dtype="<U2") # unicode
>>> arr.data.tobytes()
b'a\x00\x00\x00a\x00\x00\x00b\x00\x00\x00b\x00\x00\x00'
>>> arr.data.tobytes()
16

So where are the null-terminators? I don't see the difference between it and Unicode when specifying zero-terminated.

Why is this type not recommended? Pure bytes can save more space compared to UTF-32(or UTF-16).

CodePudding user response：

If I understand correctly, dt = np.dtype("S25") # 25-length zero-terminated bytes (example from doc) should only hold 24 valid chars(bytes), since the last one is a null-terminator.

Actually, not always. S25 means that there will be exactly 25 bytes reserved for each string item. However, if they are shorter, then a zero byte is inserted in the middle of the buffer. Here is an example:

arr = np.array(["a", "bb"], dtype="S2")
arr.data.tobytes() # b'a\x00bb'

"a" is encoded as b'a\x00' because it is shorter than the limit 2.

Why is this type not recommended? Pure bytes can save more space compared to UTF-32(or UTF-16).

Well, put it shortly: it is because of compatibility. Long before Unicode, each application used its own charset. This was a mess because most application was not able to agree on a standard charset nor communicate correctly. This resulted in a lot of bugs. Wide strings was quickly used to fix the limitation of the short very range of byte string which can simply not be sanely used in countries like Japan, China or even Arabic counties. However, this does not fix the main issue: having a way for applications to agree on a unique charset. The universal character set (aka Unicode) was born to fix this mess. This was a huge success. This is especially true for the Web. But there is a catch: how to efficiently encode strings?

The Unicode standard define several ways to encode Unicode strings. This include UTF-8, UTF-16 and UTF-32 encodings. While UTF-8 is great for English countries since most character are part of the ASCII charset so the space of a UTF-8 string is likely the same than the one of a ASCII string. The performance of computing/decoding UTF-8 strings is very fast in this case (though a bit slower than ASCII because of an additional check). The thing is when the string contains complex character like Chinese, Japanese or Arabic ones, most character are encoded in 2 to 4 bytes. Computing/decoding such string is very inefficient since the processor cannot predict conditions and SIMD instructions can hardly be used. In fact, in this case UTF-16 takes less space than UTF-8. This is also why people in Asiatic/Arabic/African countries often prefer using UTF-16 while English/European countries prefer UTF-8. UTF-32 is generally not great in term of space, but it is faster than the others for relatively-short strings containing complex characters mainly because of SIMD. It is also much simpler to implement. So it is all about choosing a trade-off between space, speed and specificity and the outcome is fundamentally local-specific.

AFAIK, the use of UTF-32 in Numpy is not documented anywhere and may change in the future. Note that Numpy tends to favour speed over the memory footprint. It make use of SIMD instruction sets to speed up most functions.