Tensorflow unicode text encoding-decoding-CodePudding

I just started working with cyrillic text. Cannot properly print russian text after text preprocessing. How can I set encoding during text loading?

import pathlib
text = pathlib.Path('rus.txt').read_text(encoding='utf-8')

lines = text.splitlines()
pairs = [line.split('\t') for line in lines]
inp = [inp for targ, inp, tag in pairs]
targ = [targ for targ, inp, tag in pairs]
inp[:20]

Output1:

['Марш!',  'Иди.',  'Идите.',  'Здравствуйте.',  'Привет!',  'Хай.', 
   'Здрасте.',  'Здоро́во!',  'Приветик!',  'Беги!',  'Бегите!',...

Creating dataset:

BUFFER_SIZE = len (inp)
BATCH_SIZE = 64
    
dataset = tf.data.Dataset.from_tensor_slices((inp, targ)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE)

for example_input_batch, example_target_batch in dataset.take(1):
  print(example_input_batch[:5]) --Russian input
  print()
  print(example_target_batch[:5]) --English target
  break

Output2:

 tf.Tensor(
    [b'\xd0\xa2\xd0\xbe\xd0\xbc \xd0\xbf\xd0\xbe\xd1\x81\xd1\x82\xd1\x83\xd0\xbf\xd0\xb8\xd0\xbb \xd1\x85\xd0\xbe\xd1\x80\xd0\xbe\xd1\x88\xd0\xbe.'
     b'\xd0\xa2\xd1\x8b \xd1\x81\xd0\xb4\xd0\xb5\xd0\xbb\xd0\xb0\xd0\xbb\xd0\xb0 \xd1\x8d\xd1\x82\xd0\xbe \xd1\x81\xd0\xbf\xd0\xb5\xd1\x86\xd0\xb8\xd0\xb0\xd0\xbb\xd1\x8c\xd0\xbd\xd0\xbe.'
     b'\xd0\xa2\xd0\xbe\xd0\xbc \xd0\xb5\xd1\x89\xd1\x91 \xd0\xbd\xd0\xb5 \xd0\xbc\xd0\xbe\xd0\xb6\xd0\xb5\xd1\x82 \xd1\x85\xd0\xbe\xd0\xb4\xd0\xb8\xd1\x82\xd1\x8c \xd1\x81\xd0\xb0\xd0\xbc.'
     b'\xd0\x94\xd1\x83\xd0\xbc\xd0\xb0\xd1\x8e, \xd0\xbf\xd0\xbe\xd1\x80\xd0\xb0 \xd0\xbc\xd0\xbd\xd0\xb5 \xd0\xbf\xd0\xbe\xd0\xb3\xd0\xbe\xd0\xb2\xd0\xbe\xd1\x80\xd0\xb8\xd1\x82\xd1\x8c \xd0\xbe\xd0\xb1 \xd1\x8d\xd1\x82\xd0\xbe\xd0\xb9 \xd0\xbf\xd1\x80\xd0\xbe\xd0\xb1\xd0\xbb\xd0\xb5\xd0\xbc\xd0\xb5 \xd1\x81 \xd0\xbd\xd0\xb0\xd1\x87\xd0\xb0\xd0\xbb\xd1\x8c\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xbc.'
     b'\xd0\xaf \xd0\xbc\xd0\xbe\xd0\xb3\xd1\x83 \xd1\x8d\xd1\x82\xd0\xbe \xd1\x83\xd0\xbb\xd0\xb0\xd0\xb4\xd0\xb8\xd1\x82\xd1\x8c.'], shape=(5,), dtype=string)

tf.Tensor(
[b'Tom did a good thing.' b'You did that on purpose.'
 b"Tom can't walk on his own yet."
 b"I think it's time for me to talk to the boss about this problem."
 b'I can arrange that.'], shape=(5,), dtype=string)

Can you please advise what is the problem here with printing russian text? English text prints ok.

CodePudding user response：

The strings that appear like nonsense are actually UTF-8 encoded. See this post for more details.

For example, the first string in the tensor

\xd0\xa2\xd0\xbe\xd0\xbc \xd0\xbf\xd0\xbe\xd1\x81\xd1\x82\xd1\x83\xd0\xbf\xd0\xb8\xd0\xbb \xd1\x85\xd0\xbe\xd1\x80\xd0\xbe\xd1\x88\xd0\xbe.

is this garbage

Ð¢Ð¾Ð¼ Ð¿Ð¾ÑÑÑÐ¿Ð¸Ð» ÑÐ¾ÑÐ¾

which can actually be decoded correctly, like this:

s = '\xd0\xa2\xd0\xbe\xd0\xbc \xd0\xbf\xd0\xbe\xd1\x81\xd1\x82\xd1\x83\xd0\xbf\xd0\xb8\xd0\xbb \xd1\x85\xd0\xbe\xd1\x80\xd0\xbe\xd1\x88\xd0\xbe.'
decoded = bytes(s, encoding='latin').decode()
print(decoded)

Output:

Том поступил хорошо.

I'm not sure exactly how to do this with Tensorflow, but perhaps tf.strings.unicode_decode can help.

CodePudding user response：

I used tf.strings.unicode_decode() function which converts encoded '\xd0\xa2\xd0\xbe\xd0\xbc' like value into a tensors of integers like [1053, 1077, 32, 1076,... . I have also converted the result into numpy array to make it applicable for chr() function which converts the unicode integer into unicode symbols.

def decode_string(ints):
  strs = [chr(i) for i in ints]
  joined = [''.join(strs)]
  return joined

decoded = tf.strings.unicode_decode(example_input_batch[:5], 'utf-8').numpy()
decoded_list = [decode_string(ex) for ex in decoded]
print(decoded_list)

The result is:

[['Том был окружён дельфинами.'], ['Бразилия была колонией Португалии.'], ['Скажи Тому, чтобы поторопился.'], ['Я слишком многого прошу?'],...