How to turn a mixed-formatted unicode string like "It\xe2\x80\x99s up to you. ¢ € £ ¥"-CodePudding

None of the answers I've come across seem to make sense, or at the least, I can't find out how to search for this issue.

I have a unicode string like this one:

It\xe2\x80\x99s up to you. ¢ € £ ¥ [

As you can see, it contains a mix of \x style unicode escaped characters, but it is already encoded into unicode. How do I turn it all into proper unicode characters (turning the \x characters into their native form).

Telling me what the \x format is called might help!

CodePudding user response：

The \x indicates a hexadecimal formatted number. This thread may help you with the conversion: How to change '\x??' to unicode in python?

CodePudding user response：

It's a multi-step process, but it can all be combined on a single line:

>>> r'It\xe2\x80\x99s up to you. ¢ € £ ¥ ['.encode('utf-8').decode('unicode_escape').encode('latin-1').decode('utf-8')
'It’s up to you. ¢ € £ ¥ ['

The first encode converts all the Unicode characters to UTF-8 byte encoding. The decode works on bytes, and unicode_escape specifically converts the \xdd sequences to a single character. encode('latin-1') is a little trick to convert each Unicode character to the equivalent byte value. Finally decode('utf-8') converts all the bytes back to characters, whether they were generated by the first encode or the second one.

This only works if the encoded bytes are actually UTF-8 as they were in this case.