Home > front end >  Python not able to read "–" character from text file
Python not able to read "–" character from text file

Time:10-21

Using Python, I am fetching some text data from an API and storing it in a text file after some transformations and then reading this text file from a different process.

There are no problems while reading data from API, but I am getting this error while reading the text file:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 907: invalid start byte

The byte being read as '0x96' is actually "–" character in API data and this error occurs only when encoding argument is explicitly specified as 'utf-8'. It doesn't occur when encoding is not explicitly passed to open function while opening the text file.

My questions:

  1. Why do we get this error only when encoding is specified? I think, we should get the same error in other case as well since default encoding is also 'UTF-8'. (Please correct me if I am wrong)
  2. Is it possible to resolve this issue without changing the way I am reading the text file? (i.e. Can I make any changes to the stage where I am creating this text file from API data?)

Really appreciate you looking into it. Thanks!

CodePudding user response:

In open() the default encoding is platform dependent, you can find out what is the default for your system by checking what locale.getpreferredencoding() returns. This is from the documentation

For the 2nd part of your question, since you are not getting an error when you do not specify utf-8 as encoding, you could just use the output for locale.getpreferredencoding() as the encoding method.

CodePudding user response:

You could do this for each line of the text if you are doing it this way. Since 0x96 is considered a "non-printable".

import re
...
line = re.sub(r'\x96',r'\x2D', line) 
  • Related