What is the string representation of a bytes object really and how to get back the bytes object?-CodePudding

I have some text encoded to bytes using utf-8 encoding. When processing this text I incautiously used str() to make it a Unicode string because I assumed this would automatically decode the bytes object with the right encoding. This, however, is not the case. For example:

a = "عجائب"
a_bytes = a.encode(encoding="utf-8")
b = str(a_bytes)

yields

b = "b'\\xd8\\xb9\\xd8\\xac\\xd8\\xa7\\xd8\\xa6\\xd8\\xa8'"

which is not what I expected. According to the docs

If neither encoding nor errors is given, str(object) returns type(object).__str__(object), [...].

So my question is: What is the implemented string representation of a bytes object in Python and can I recreate my original Unicode string from it in general?

CodePudding user response：

It gives you a string containing the string representation of a bytes object:

>>> a = "عجائب"
>>> a_bytes = a.encode(encoding="utf-8")
>>> a_bytes
b'\xd8\xb9\xd8\xac\xd8\xa7\xd8\xa6\xd8\xa8'
>>> str(a_bytes)
"b'\\xd8\\xb9\\xd8\\xac\\xd8\\xa7\\xd8\\xa6\\xd8\\xa8'"

Meaning, you have a valid literal representing a bytes. You can parse that literal again to an actual bytes using ast.literal_eval:

>>> import ast
>>> ast.literal_eval(str(a_bytes))
b'\xd8\xb9\xd8\xac\xd8\xa7\xd8\xa6\xd8\xa8'

This is again the same as a_bytes. You can properly decode those to a str again, either using .decode, or by using the encoding parameter of str:

>>> str(a_bytes, 'utf-8')
'عجائب'
>>> a_bytes.decode('utf-8')
'عجائب'

CodePudding user response：

When you call str() and pass it as an argument a bytes variable, it converts from bytes to string. If you want to decode from utf-8 bytes to the original string, you need to use decode() function and specify the initial coding method:

a = "عجائب"
a_bytes = a.encode(encoding="utf-8")
b = str(a_bytes)

print(b)
print(a_bytes)
print(a_bytes.decode("utf-8")) #Prints decoded string from bytes

Output:

b'\xd8\xb9\xd8\xac\xd8\xa7\xd8\xa6\xd8\xa8'
b'\xd8\xb9\xd8\xac\xd8\xa7\xd8\xa6\xd8\xa8'
عجائب