I have some text encoded to bytes using utf-8
encoding. When processing this text I incautiously used str()
to make it a Unicode string because I assumed this would automatically decode the bytes object with the right encoding. This, however, is not the case. For example:
a = "عجائب"
a_bytes = a.encode(encoding="utf-8")
b = str(a_bytes)
yields
b = "b'\\xd8\\xb9\\xd8\\xac\\xd8\\xa7\\xd8\\xa6\\xd8\\xa8'"
which is not what I expected. According to the docs
If neither encoding nor errors is given, str(object) returns
type(object).__str__(object)
, [...].
So my question is: What is the implemented string representation of a bytes object in Python and can I recreate my original Unicode string from it in general?
CodePudding user response:
It gives you a string containing the string representation of a bytes object:
>>> a = "عجائب"
>>> a_bytes = a.encode(encoding="utf-8")
>>> a_bytes
b'\xd8\xb9\xd8\xac\xd8\xa7\xd8\xa6\xd8\xa8'
>>> str(a_bytes)
"b'\\xd8\\xb9\\xd8\\xac\\xd8\\xa7\\xd8\\xa6\\xd8\\xa8'"
Meaning, you have a valid literal representing a bytes
. You can parse that literal again to an actual bytes
using ast.literal_eval
:
>>> import ast
>>> ast.literal_eval(str(a_bytes))
b'\xd8\xb9\xd8\xac\xd8\xa7\xd8\xa6\xd8\xa8'
This is again the same as a_bytes
. You can properly decode those to a str
again, either using .decode
, or by using the encoding
parameter of str
:
>>> str(a_bytes, 'utf-8')
'عجائب'
>>> a_bytes.decode('utf-8')
'عجائب'
CodePudding user response:
When you call str()
and pass it as an argument a bytes variable, it converts from bytes to string. If you want to decode from utf-8
bytes to the original string, you need to use decode()
function and specify the initial coding method:
a = "عجائب"
a_bytes = a.encode(encoding="utf-8")
b = str(a_bytes)
print(b)
print(a_bytes)
print(a_bytes.decode("utf-8")) #Prints decoded string from bytes
Output:
b'\xd8\xb9\xd8\xac\xd8\xa7\xd8\xa6\xd8\xa8'
b'\xd8\xb9\xd8\xac\xd8\xa7\xd8\xa6\xd8\xa8'
عجائب