Decoding a bytes sequence - what's the train of thought when doing it-CodePudding

I have this sequence and I have to decode it, as a complete beginner in Python and in encoding.

enc = b'\x80\x03}q\x00(K\x01K\x01K\x02K\x03K\x03K\x06K\x04G?\xc5UUUUUUK\x05G?\xe0\x00\x00\x00\x00\x00\x00K\x06G?\x9cq\xc7\x1cq\xc7\x1cK\x07G?\xc5UUUUUUK\x08K$K\tG?\xb5UUUUUUK\nK\x07K\x0bG?\xe5UUUUUUK\x0cG?\xb5UUUUUUK\rG?\xedUUUUUUK\x0eK4K\x0fG?\xb3\xb1;\x13\xb1;\x14K\x10K\x00K\x11G?\xcd\x89\xd8\x9d\x89\xd8\x9eK\x12G?\xcb\x9b\x9b\x9b\x9b\x9b\x9cK\x13G?\xa4\x14\x14\x14\x14\x14\x14K\x14X\x08\x00\x00\x00discretaq\x01K\x15K\x02K\x16X\x02\x00\x00\x00daq\x02K\x17G?\xe4z\xe1G\xae\x14{K\x18G@\x15\x00\x00\x00\x00\x00\x00K\x19G?\xe4z\xe1G\xae\x14|K\x1aK2K\x1bK\x01K\x1cK\x03K\x1dG?\xd5UUUUUUK\x1eG?\xc5UUUUUUK\x1fK\x01K K\x04K!G?\xaf\xf2\xe4\x8e\x8aq\xdeK"K\x04K#X\x04\x00\x00\x00mareq\x03u.'

I tried doing it this way

strputere = enc.decode()

print(strputere)

and I get an error

File "encode.py", line 4, in <module>
    strputere = enc.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

I started doing a bit of research, and I found that b stands for bytes.

So my enc variable is a bytes string literal. I've looked into .decode() and it seemed like it was a good choice - but it might be not.

I'm a bit confused because it is a bytes string literal, but it contains some characters (such as \x80) that I think they are UTF-8 characters.

So, how can I decode this, and what would be the algorithm for that? I would love to understand what happens, I did my research but I'm a bit lost, I'd need some help.

CodePudding user response：

So, generally when you have a byte sequence you have two different ways to approach it, depending on the contents:

Is it a pure string sequence?

If dealing with a pure string sequence, you need to decode using the following:

enc.decode("utf-8")

Keep in mind that in this case, you must know what encoding was used (here utf-8). But it appears that it might be incorrect according to the error message you got. S

If you don't know the encoding but you know its definitely a string-encoding, you can take a look at the options mentioned in this question here

Sensor/Other input

If you are using an embedded device, or any bytes input that might contain a series of data, and not just one field, you must use struct.unpack(). This is a bit more complicated, and you will need to go through the docs to find the exact string you must use to decode.

The way it works is that you tell python what each bytes are (string, int, etc) and how long each one is, and it will convert it into a tuple of objects as follows:

values = list(struct.unpack('>BBHBBhBHhHL', enc))

CodePudding user response：

These data are encoded by using the python pickle module. You can decode it so:

>>> import pickle
>>> numbers = pickle.loads(enc)
>>> print(numbers)
{1: 1, 2: 3, 3: 6, 4: 0.16666666666666666, 5: 0.5, 6: 0.027777777777777776, 7: 0.16666666666666666, 8: 36, 9: 0.08333333333333333, 10: 7, 11: 0.6666666666666666, 12: 0.08333333333333333, 13: 0.9166666666666666, 14: 52, 15: 0.07692307692307693, ...

CodePudding user response：

The error is happening because the string contains non-ASCII characters which are not decodable using utf-8.

Is it just random data or is it encoded using some particular encoding? Decoding using "unicode_escape" does work, but the output does not appear that useful.

enc.decode("unicode_escape")

returns:

'\x80\x03}q\x00(K\x01K\x01K\x02K\x03K\x03K\x06K\x04G?ÅUUUUUUK\x05G?à\x00\x00\x00\x00\x00\x00K\x06G?\x9cqÇ\x1cqÇ\x1cK\x07G?ÅUUUUUUK\x08K$K\tG?µUUUUUUK\nK\x07K\x0bG?åUUUUUUK\x0cG?µUUUUUUK\rG?íUUUUUUK\x0eK4K\x0fG?³±;\x13±;\x14K\x10K\x00K\x11G?Í\x89Ø\x9d\x89Ø\x9eK\x12G?Ë\x9b\x9b\x9b\x9b\x9b\x9cK\x13G?¤\x14\x14\x14\x14\x14\x14K\x14X\x08\x00\x00\x00discretaq\x01K\x15K\x02K\x16X\x02\x00\x00\x00daq\x02K\x17G?äzáG®\x14{K\x18G@\x15\x00\x00\x00\x00\x00\x00K\x19G?äzáG®\x14|K\x1aK2K\x1bK\x01K\x1cK\x03K\x1dG?ÕUUUUUUK\x1eG?ÅUUUUUUK\x1fK\x01K K\x04K!G?¯òä\x8e\x8aqÞK"K\x04K#X\x04\x00\x00\x00mareq\x03u.'