None of the answers I've come across seem to make sense, or at the least, I can't find out how to search for this issue.
I have a unicode string like this one:
It\xe2\x80\x99s up to you. ¢ € £ ¥ [
As you can see, it contains a mix of \x style unicode escaped characters, but it is already encoded into unicode. How do I turn it all into proper unicode characters (turning the \x characters into their native form).
Telling me what the \x format is called might help!
CodePudding user response:
The \x indicates a hexadecimal formatted number. This thread may help you with the conversion: How to change '\x??' to unicode in python?
CodePudding user response:
It's a multi-step process, but it can all be combined on a single line:
>>> r'It\xe2\x80\x99s up to you. ¢ € £ ¥ ['.encode('utf-8').decode('unicode_escape').encode('latin-1').decode('utf-8')
'It’s up to you. ¢ € £ ¥ ['
The first encode
converts all the Unicode characters to UTF-8 byte encoding. The decode
works on bytes, and unicode_escape
specifically converts the \xdd
sequences to a single character. encode('latin-1')
is a little trick to convert each Unicode character to the equivalent byte value. Finally decode('utf-8')
converts all the bytes back to characters, whether they were generated by the first encode
or the second one.
This only works if the encoded bytes are actually UTF-8 as they were in this case.