Split a long byte array into numpy array of strings-CodePudding

Usually, when creating an numpy array of strings, we can do something like

import numpy as np
np.array(["Hello world!", "good bye world!", "whatever world"])
>>> array(['Hello world!', 'good bye world!', 'whatever world'], dtype='<U15')

Now the question is, I am given a long bytearray from a foreign C function like this:

b'Hello world!\x00<some rubbish bytes>good bye world!\x00<some rubbish bytes>whatever world\x00<some rubbish bytes>'

It is guaranteed that every 32 bytes is a null-terminated string (i.e., there is a \x00 byte appended to the valid part of the string) and I need to convert this long bytearray to something like this, array(['Hello world!', 'good bye world!', 'whatever world'], dtype='<U15'), preferably in-place (i.e., no memory copy).

This is what I do now：

for i in range(str_count):
    str_arr[i] = byte_arr[i * 32: (i 1) * 32].split(b'\x00')[0].decode('utf-8')
str_arr_np = np.array(str_arr),

It works, but it is kind of awkward and not done in-place (bytes are copied at least once, if not twice). Are there any better approaches?

CodePudding user response：

If you can zero out the data on the C side, then you can use np.frombuffer and it will be about as efficient as you can reasonably expect:

So, if you can zero out the data, then this can be read using numpy.frombuffer and it will probably be as efficient as you can reasonably expect to get:

>>> raw = b'hello world\x00\x00\x00\x00\x00Good Bye\x00\x00\x00\x00\x00\x00\x00\x00'
>>> np.frombuffer(raw, dtype='S16')
array([b'hello world', b'Good Bye'], dtype='|S16')

Of course, this gives you a bytes string, not unicode string, although, that may be desirable in your case.

Note, the above relies on the built-in behavior of stripping trailing null bytes, if you have garbage afterwards, it won't work:

>>> data = b'hello world\x00aaaaGood Bye\x00\x00\x00\x00\x00\x00\x00\x00'
>>> np.frombuffer(data, dtype='S16')
array([b'hello world\x00aaaa', b'Good Bye'], dtype='|S16')

Note, this shouldn't make a copy, notice:

>>> arr = np.frombuffer(raw, dtype='S16')
>>> arr
array([b'hello world', b'Good Bye'], dtype='|S16')
>>> arr[0] = b"z"*16
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: assignment destination is read-only

However, if the destination is not read-only, so say you had a bytearray to begin with:

>>> raw = bytearray(raw)
>>> arr = np.frombuffer(raw, dtype='S16')
>>> arr[0] = b"z"*16
>>> arr
array([b'zzzzzzzzzzzzzzzz', b'Good Bye'], dtype='|S16')
>>> raw
bytearray(b'zzzzzzzzzzzzzzzzGood Bye\x00\x00\x00\x00\x00\x00\x00\x00')