Use string as bytes-CodePudding

My problem is as follows:

I'm reading a .csv generated by some software and to read it I'm using Pandas. Pandas read the .csv properly but one of the columns stores bytes sequences representing vectors and Pandas stores them as a string.

So I have data (string) and I want to use np.frombuffer() to get the proper vector. The problem is, data is a string so its already encoded so when I use .encode() to turn it into bytes, the sequence is not the original one.

Example: The .csv contains \x00\x00 representing the vector [0,0] with dtype=np.uint8. Pandas stores it as a string and when I try to process it something like this happens:

data = df.data[x] # With x any row.
type(data)

print(data)

\x00\x00

e_data = data.encode("latin1")
print(e_data)

b'\\x00\\x00'

v = np.frombuffer(e_data, np.uint8)
print(v)

array([ 92 120 48 48 92 120 48 48], dtype=uint8)

I just want to get b'\x00\x00' from data instead of b'\\x00\\x00' which I understand is a little encoding mess I have not been able to fix yet.

Any way to do this?

Thanks!

CodePudding user response：

Issue: you (apparently) have a string that contains literal backslash escape sequences, such as:

>>> x = r'\x00' # note the use of a raw string literal
>>> x # Python's representation of the string escapes the backslash
'\\x00'
>>> print(x) # but it looks right when printing
\x00

From this, you wish to create a corresponding bytes object, wherein the backslash-escape sequences are translated into the corresponding byte.

Handling these kinds of escape sequences is done using the unicode-escape string encoding. As you may be aware, string encodings convert between bytes and str objects, specifying the rules for which byte sequences correspond to what Unicode code points.

However, the unicode-escape codec assumes that the escape sequences are on the bytes side of the equation and that the str side will have the corresponding Unicode characters:

>>> rb'\x00'.decode('unicode-escape') # create a string with a NUL char
'\x00'

Applying .encode to the string will reverse that process; so if you start with the backslash-escape sequence, it will re-escape the backslash:

>>> r'\x00'.encode('unicode-escape') # the result contains two backslashes, represented as four
b'\\\\x00'
>>> list(r'\x00'.encode('unicode-escape')) # let's look at the numeric values of the bytes
[92, 92, 120, 48, 48]

As you can see, that is clearly not what we want.

We want to convert from bytes to str to do the backslash-escaping. But we have a str to start, so we need to change that to bytes; and we want bytes at the end, so we need to change the str that we get from the backslash-escaping. In both cases, we need to make each Unicode code point from 0-255 inclusive, correspond to a single byte with the same value.

The encoding we need for that task is called latin-1, also known as iso-8859-1.

For example:

>>> r'\x00'.encode('latin-1')
b'\\x00'

Thus, we can reason out the overall conversion:

>>> r'\x00'.encode('latin-1').decode('unicode-escape').encode('latin-1')
b'\x00'

As desired: our str with a literal backslash, lowercase x and two zeros, is converted to a bytes object containing a single zero byte.

Alternately: we can request that backslash-escapes are processed while decoding, by using escape_decode from the codecs standard library module. However, this isn't documented and isn't really meant to be used that way - it's internal stuff used to implement the unicode-escape codec and possibly some other things.

If you want to expose yourself to the risk of that breaking in the future, it looks like:

>>> import codecs
>>> codecs.escape_decode(r'\x00\x00')
(b'\x00\x00', 8)

We get a 2-tuple, with the desired bytes and what I assume is the number of Unicode code points that were decoded (i.e. the length of the string). From my testing, it appears that it can only use UTF-8 encoding for the non-backslash sequences (but this could be specific to how Python is configured), and you can't change this; there is no actual parameter to specify the encoding, for a decode method. Like I said - not meant for general use.

Yes, all of that is as awkward as it seems. The reason you don't get easy support for this kind of thing is that it isn't really how you're intended to design your system. Fundamentally, all data is bytes; text is an abstraction that is encoded by that byte data. Using a single byte (with value 0) to represent four characters of text (the symbols \, x, 0 and 0) is not a normal encoding, and not a reversible one (how do I know whether to decode the byte as those four characters, or as a single NUL character?). Instead, you should strongly consider using some other friendly string representation of your data (perhaps a plain hex dump) and a non-text-encoding-related way to parse it. For example:

>>> data = '41 42' # a string in a simple hex dump format
>>> bytes.fromhex(data) # support is built-in, and works simply
b'AB'
>>> list(bytes.fromhex(data))
[65, 66]