Home > front end >  How to turn a mixed-formatted unicode string like "It\xe2\x80\x99s up to you. ¢ € £ ¥"
How to turn a mixed-formatted unicode string like "It\xe2\x80\x99s up to you. ¢ € £ ¥"

Time:09-12

None of the answers I've come across seem to make sense, or at the least, I can't find out how to search for this issue.

I have a unicode string like this one:

It\xe2\x80\x99s up to you. ¢ € £ ¥ [

As you can see, it contains a mix of \x style unicode escaped characters, but it is already encoded into unicode. How do I turn it all into proper unicode characters (turning the \x characters into their native form).

Telling me what the \x format is called might help!

CodePudding user response:

The \x indicates a hexadecimal formatted number. This thread may help you with the conversion: How to change '\x??' to unicode in python?

CodePudding user response:

It's a multi-step process, but it can all be combined on a single line:

>>> r'It\xe2\x80\x99s up to you. ¢ € £ ¥ ['.encode('utf-8').decode('unicode_escape').encode('latin-1').decode('utf-8')
'It’s up to you. ¢ € £ ¥ ['

The first encode converts all the Unicode characters to UTF-8 byte encoding. The decode works on bytes, and unicode_escape specifically converts the \xdd sequences to a single character. encode('latin-1') is a little trick to convert each Unicode character to the equivalent byte value. Finally decode('utf-8') converts all the bytes back to characters, whether they were generated by the first encode or the second one.

This only works if the encoded bytes are actually UTF-8 as they were in this case.

  • Related