I have a file that is encoded in us-ascii, as shown by the next command:
$ file -i /tmp/text
/tmp/text: text/plain; charset=us-ascii
But it contains many latin-1 encoded characters, for example:
Hij verblijft samen met zijn gezin in Belgi\xc3\xab
Activist Roger Espa\xc3\xb1ol raakte zijn oog kwijt door een politiekogel
I would like to replace these wrong characters with the correct ones.
What I tried:
$ iconv -f latin1 -t utf-8 text > text.1
with open("text") as f: text = f.read().encode("latin-1").decode("utf-8")
with open("text", "w") as f: f.write(text)
ftfy -e latin-1 text > text.1
and many variations of the above attempts. Any help is appreciated
CodePudding user response:
Try this python script :
#!/usr/bin/env python3
import re
def convert(s):
return b'%c' % int(s.group(0)[2:],16)
with open("text", 'rb') as f:
text = re.sub(rb'\\x..', convert, f.read())
with open("text", "wb") as f:
f.write(text)