Usually, when creating an numpy array of strings, we can do something like
import numpy as np
np.array(["Hello world!", "good bye world!", "whatever world"])
>>> array(['Hello world!', 'good bye world!', 'whatever world'], dtype='<U15')
Now the question is, I am given a long bytearray from a foreign C function like this:
b'Hello world!\x00<some rubbish bytes>good bye world!\x00<some rubbish bytes>whatever world\x00<some rubbish bytes>'
It is guaranteed that every 32 bytes is a null-terminated string (i.e., there is a \x00
byte appended to the valid part of the string) and I need to convert this long bytearray to something like this, array(['Hello world!', 'good bye world!', 'whatever world'], dtype='<U15')
, preferably in-place (i.e., no memory copy).
This is what I do now:
for i in range(str_count):
str_arr[i] = byte_arr[i * 32: (i 1) * 32].split(b'\x00')[0].decode('utf-8')
str_arr_np = np.array(str_arr),
It works, but it is kind of awkward and not done in-place (bytes are copied at least once, if not twice). Are there any better approaches?
CodePudding user response:
If you can zero out the data on the C side, then you can use np.frombuffer
and it will be about as efficient as you can reasonably expect:
So, if you can zero out the data, then this can be read using numpy.frombuffer
and it will probably be as efficient as you can reasonably expect to get:
>>> raw = b'hello world\x00\x00\x00\x00\x00Good Bye\x00\x00\x00\x00\x00\x00\x00\x00'
>>> np.frombuffer(raw, dtype='S16')
array([b'hello world', b'Good Bye'], dtype='|S16')
Of course, this gives you a bytes string, not unicode string, although, that may be desirable in your case.
Note, the above relies on the built-in behavior of stripping trailing null bytes, if you have garbage afterwards, it won't work:
>>> data = b'hello world\x00aaaaGood Bye\x00\x00\x00\x00\x00\x00\x00\x00'
>>> np.frombuffer(data, dtype='S16')
array([b'hello world\x00aaaa', b'Good Bye'], dtype='|S16')
Note, this shouldn't make a copy, notice:
>>> arr = np.frombuffer(raw, dtype='S16')
>>> arr
array([b'hello world', b'Good Bye'], dtype='|S16')
>>> arr[0] = b"z"*16
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: assignment destination is read-only
However, if the destination is not read-only, so say you had a bytearray
to begin with:
>>> raw = bytearray(raw)
>>> arr = np.frombuffer(raw, dtype='S16')
>>> arr[0] = b"z"*16
>>> arr
array([b'zzzzzzzzzzzzzzzz', b'Good Bye'], dtype='|S16')
>>> raw
bytearray(b'zzzzzzzzzzzzzzzzGood Bye\x00\x00\x00\x00\x00\x00\x00\x00')