Home > database >  Two unicode encodings represent 1 cyrillic letter
Two unicode encodings represent 1 cyrillic letter

Time:06-06

I have such string in unicode and utf-8 representation:

\u00d0\u0095\u00d1\u0081\u00d0\u00bb\u00d0\u00b8\u00d0\u00bf\u00d0\u00be\u00d0\u00b2\u00d0\u00b5\u00d0\u00b7\u00d0\u00b5\u00d1\u0082 \u00d1\u0082\u00d0\u00be\u00d1\u0081\u00d0\u00b5\u00d0\u00b3\u00d0\u00be\u00d0\u00b4\u00d0\u00bd\u00d1\u008f\u00d1\u0083\u00d0\u00b6\u00d0\u00b5\u00d1\u0081\u00d0\u00ba\u00d0\u00b8\u00d0\u00bd\u00d1\u0083

and

ЕÑли повезет то ÑÐµÐ³Ð¾Ð´Ð½Ñ ÑƒÐ¶Ðµ Ñкину.

The desired ouput is "Если повезет то сегодня уже скину".

I have tried all possible encodings but still wasn't able to get it in complete cyrillic form.

The best I got was

'�?�?ли повезе�? �?о �?егодн�? �?же �?кин�?'

using windows-1252.

And also I've noticed that one cyrillic letter in desired string means two unicode encodings.

For example: \u00d0\u0095 = 'Е'. Maybe someone knows what encoding and how to use it to get a normal result?

CodePudding user response:

d0 95 d1 81 d0 bb d0 b8 is the correct UTF-8 octet stream for "Если".

So you need to convert each character to a byte (8-bit word, octet) by removing the most significant part (which is always 0 anyway in your example). Then decode them as UTF-8.

Or better, go back to the source from which you got this, and make sure the stream of octets is not seen as single-byte encoding.

CodePudding user response:

You have a mis-decoded string where the UTF-8 bytes were translated as ISO-8859-1 (also known as latin1). Ideally, re-download with the correct encoding, but you can also encode with the wrongly-used encoding to regain the original byte stream, then decode with the right encoding (UTF-8):

Python:

>>> s = '\u00d0\u0095\u00d1\u0081\u00d0\u00bb\u00d0\u00b8\u00d0\u00bf\u00d0\u00be\u00d0\u00b2\u00d0\u00b5\u00d0\u00b7\u00d0\u00b5\u00d1\u0082 \u00d1\u0082\u00d0\u00be\u00d1\u0081\u00d0\u00b5\u00d0\u00b3\u00d0\u00be\u00d0\u00b4\u00d0\u00bd\u00d1\u008f\u00d1\u0083\u00d0\u00b6\u00d0\u00b5\u00d1\u0081\u00d0\u00ba\u00d0\u00b8\u00d0\u00bd\u00d1\u0083'
>>> s
'Ð\x95Ñ\x81липовезеÑ\x82 Ñ\x82оÑ\x81егоднÑ\x8fÑ\x83жеÑ\x81кинÑ\x83'
>>> print(s)
ÐÑÐ»Ð¸Ð¿Ð¾Ð²ÐµÐ·ÐµÑ ÑоÑегоднÑÑжеÑкинÑ
>>> s.encode('latin1')
b'\xd0\x95\xd1\x81\xd0\xbb\xd0\xb8\xd0\xbf\xd0\xbe\xd0\xb2\xd0\xb5\xd0\xb7\xd0\xb5\xd1\x82 \xd1\x82\xd0\xbe\xd1\x81\xd0\xb5\xd0\xb3\xd0\xbe\xd0\xb4\xd0\xbd\xd1\x8f\xd1\x83\xd0\xb6\xd0\xb5\xd1\x81\xd0\xba\xd0\xb8\xd0\xbd\xd1\x83'
>>> s.encode('latin1').decode('utf8')
'Еслиповезет тосегодняужескину'

You may also have a literal string of Unicode escape codes, which is a bit trickier:

>>> s=r'\u00d0\u0095\u00d1\u0081\u00d0\u00bb\u00d0\u00b8\u00d0\u00bf\u00d0\u00be\u00d0\u00b2\u00d0\u00b5\u00d0\u00b7\u00d0\u00b5\u00d1\u0082 \u00d1\u0082\u00d0\u00be\u00d1\u0081\u00d0\u00b5\u00d0\u00b3\u00d0\u00be\u00d0\u00b4\u00d0\u00bd\u00d1\u008f\u00d1\u0083\u00d0\u00b6\u00d0\u00b5\u00d1\u0081\u00d0\u00ba\u00d0\u00b8\u00d0\u00bd\u00d1\u0083'
>>> print(s)
\u00d0\u0095\u00d1\u0081\u00d0\u00bb\u00d0\u00b8\u00d0\u00bf\u00d0\u00be\u00d0\u00b2\u00d0\u00b5\u00d0\u00b7\u00d0\u00b5\u00d1\u0082 \u00d1\u0082\u00d0\u00be\u00d1\u0081\u00d0\u00b5\u00d0\u00b3\u00d0\u00be\u00d0\u00b4\u00d0\u00bd\u00d1\u008f\u00d1\u0083\u00d0\u00b6\u00d0\u00b5\u00d1\u0081\u00d0\u00ba\u00d0\u00b8\u00d0\u00bd\u00d1\u0083

In this case, the string has to be converted back to bytes, decoded as Unicode escapes, then encoded back to bytes and correctly decoded as UTF-8. latin1 has the feature that the first 256 code points of Unicode map to bytes 0-255 in that codec, so it converts 1:1 code point to byte value.

>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'Еслиповезет тосегодняужескину'
  • Related