Home > Enterprise >  Replace coded text by unicode text in Vietnamese
Replace coded text by unicode text in Vietnamese

Time:06-28

I have a csv file saved under the name as sample.csv as follows:

No,duong
1, Ðu<U 1EDD>ng ÐT 605
2, Ðu<U 1EDD>ng Nam K<U 1EF3> Kh<U 1EDF>i Nghia
3, Ðu<U 1EDD>ng Duy Tân

I have another csv file named viscii.csv containing the code of characters:

key, value
<U 1EDD>,ờ
<U 1EF3>,ỳ
<U 1EDF>,ở

I make the followings:

import pandas as pd
duong = pd.read_csv('sample.csv')
code = pd.read_csv('viscii.csv')
code_dict = dict((a, b) for a, b in zip(code['key'],code[' value']))
duong.replace(code_dict, regex = True)

The results are:

    No                                             duong
0    1                               Ðu<U 1EDD>ng ÐT 605
1    2      Ðu<U 1EDD>ng Nam K<U 1EF3> Kh<U 1EDF>i Nghia
2    3                              Ðu<U 1EDD>ng Duy Tân

This is not what I want. What I want is:

No                                             duong
0    1                               Ðuờng ÐT 605
1    2                    Ðuờng Nam Kỳ Khởi Nghia
2    3                              Ðuờng Duy Tân

It means

<U 1EDD> is replaced by "ờ"
<U 1EF3> is replaced by "ỳ"
<U 1EDF> is replaced by "ở"

Can you please tell me what went wrong with this decoding?

CodePudding user response:

Your answer would work in your specific case, but the key here is that you want to escape your regex string, because you don't want it to consider the as a special character.

Your answer works because enclosing the special character in [...] removes the special meaning, and regex interprets [ ] as any character in the group that is inside brackets. A more general-purpose solution would be to use the re.escape function to escape the regex when you create your dictionary:

import pandas as pd
import re

duong = pd.read_csv('sample.csv')
code = pd.read_csv('viscii.csv')

code_dict = dict((re.escape(a), b) for a, b in zip(code['key'],code[' value']))

This creates a dictionary that looks like so:

{'<U\\ 1EDD>': 'ờ', '<U\\ 1EF3>': 'ỳ', '<U\\ 1EDF>': 'ở'}

Notice the backslash in the keys before the : re.escape knows that a bare is a special character, so it escaped it for us. Now, when you do

duong.replace(code_dict, regex=True)

you get:

   No                     duong
0   1              Ðuờng ÐT 605
1   2   Ðuờng Nam Kỳ Khởi Nghia
2   3             Ðuờng Duy Tân

CodePudding user response:

I think I found my mistake.

RegEx confused with ' ', so I replaced ' ' with '[ ]'.

The code is

code = pd.read_csv('viscii.csv')
code['key'] = code["key"].str.replace(" ", "[ ]", regex = True)
code_dict = dict((a, b) for a, b in zip(code['key'],code[' value']))
duong.replace(code_dict, regex = True)

And then things went well

  • Related