I have a csv file saved under the name as sample.csv as follows:
No,duong
1, Ðu<U 1EDD>ng ÐT 605
2, Ðu<U 1EDD>ng Nam K<U 1EF3> Kh<U 1EDF>i Nghia
3, Ðu<U 1EDD>ng Duy Tân
I have another csv file named viscii.csv containing the code of characters:
key, value
<U 1EDD>,ờ
<U 1EF3>,ỳ
<U 1EDF>,ở
I make the followings:
import pandas as pd
duong = pd.read_csv('sample.csv')
code = pd.read_csv('viscii.csv')
code_dict = dict((a, b) for a, b in zip(code['key'],code[' value']))
duong.replace(code_dict, regex = True)
The results are:
No duong
0 1 Ðu<U 1EDD>ng ÐT 605
1 2 Ðu<U 1EDD>ng Nam K<U 1EF3> Kh<U 1EDF>i Nghia
2 3 Ðu<U 1EDD>ng Duy Tân
This is not what I want. What I want is:
No duong
0 1 Ðuờng ÐT 605
1 2 Ðuờng Nam Kỳ Khởi Nghia
2 3 Ðuờng Duy Tân
It means
<U 1EDD> is replaced by "ờ"
<U 1EF3> is replaced by "ỳ"
<U 1EDF> is replaced by "ở"
Can you please tell me what went wrong with this decoding?
CodePudding user response:
Your answer would work in your specific case, but the key here is that you want to escape your regex string, because you don't want it to consider the
as a special character.
Your answer works because enclosing the special character in [...]
removes the special meaning, and regex interprets [ ]
as any character in the group that is inside brackets. A more general-purpose solution would be to use the re.escape
function to escape the regex when you create your dictionary:
import pandas as pd
import re
duong = pd.read_csv('sample.csv')
code = pd.read_csv('viscii.csv')
code_dict = dict((re.escape(a), b) for a, b in zip(code['key'],code[' value']))
This creates a dictionary that looks like so:
{'<U\\ 1EDD>': 'ờ', '<U\\ 1EF3>': 'ỳ', '<U\\ 1EDF>': 'ở'}
Notice the backslash in the keys before the
: re.escape
knows that a bare
is a special character, so it escaped it for us. Now, when you do
duong.replace(code_dict, regex=True)
you get:
No duong
0 1 Ðuờng ÐT 605
1 2 Ðuờng Nam Kỳ Khởi Nghia
2 3 Ðuờng Duy Tân
CodePudding user response:
I think I found my mistake.
RegEx confused with ' ', so I replaced ' ' with '[ ]'.
The code is
code = pd.read_csv('viscii.csv')
code['key'] = code["key"].str.replace(" ", "[ ]", regex = True)
code_dict = dict((a, b) for a, b in zip(code['key'],code[' value']))
duong.replace(code_dict, regex = True)
And then things went well