need to translate and convert encoded values to readable english strings in python-CodePudding

I have below like dataframe where I have japanese,chinese languages in company name...

 data = [['company1', '<U 042E><U 043F><U 0438><U 0442><U 0435><U 0440>'], ['company2', 
 '<c1>lom<e9>kszer Kft.'], ['company3', 'Ernst and young'],
   ['company4', '<c5>bo Akademi']]

  df = pd.DataFrame(data, columns = ['Name', 'company_name'])

it looks like below

now all I want is to convert and translate these values to readable english values.

can I do that? , if yes, how , Please..

CodePudding user response：

Your examples do not exhibit a single unified encoding. We can speculate that the two-digit ones are Latin-1, but I'm guessing (based also on the duplicate question) that the truth is really more complex than that.

Anyway, for general direction at least, try this:

import re
...
for index in range(len(data)):
    data[index][1] = re.sub(
        r'<U\ ([0-9a-fA-F]{4})>', 
        lambda x: chr(int(x.group(1), 16)),
        re.sub(
            r'<([0-9a-fA-F]{2})>',
            lambda x: chr(int(x.group(1), 16)), 
            data[index][1]))

Demo: https://ideone.com/X60x3Q

You can avoid the repeated lambda expression at the cost of a slightly more complex regular expression.

for index in range(len(data)):
    data[index][1] = re.sub(
        r'<(?:U\ )?((?<=\ )[0-9a-fA-F]{4}|(?<=<)[0-9a-fA-F]{2})>', 
        lambda x: chr(int(x.group(1), 16)),
        data[index][1])

Demo: https://ideone.com/SkuvAJ

CodePudding user response：

This needs some work. I just translated it manually. Here it is:

>>> '<U 042E><U 043F><U 0438><U 0442><U 0435><U 0440>'
'<U 042E><U 043F><U 0438><U 0442><U 0435><U 0440>' # not useful!