Home > Mobile >  Fix file encoded with the wrong charset
Fix file encoded with the wrong charset

Time:04-29

I have a file that is encoded in us-ascii, as shown by the next command:

$ file -i /tmp/text
/tmp/text: text/plain; charset=us-ascii

But it contains many latin-1 encoded characters, for example:

Hij verblijft samen met zijn gezin in Belgi\xc3\xab
Activist Roger Espa\xc3\xb1ol raakte zijn oog kwijt door een politiekogel

I would like to replace these wrong characters with the correct ones.

What I tried:

$ iconv -f latin1 -t utf-8 text > text.1
with open("text") as f: text = f.read().encode("latin-1").decode("utf-8")
with open("text", "w") as f: f.write(text)
ftfy -e latin-1 text > text.1

and many variations of the above attempts. Any help is appreciated

CodePudding user response:

Try this python script :

#!/usr/bin/env python3
  
import re

def convert(s):
    return b'%c' % int(s.group(0)[2:],16) 

with open("text", 'rb') as f:
    text = re.sub(rb'\\x..', convert, f.read())

with open("text", "wb") as f:
    f.write(text)
  • Related