Using Python, I am fetching some text data from an API and storing it in a text file after some transformations and then reading this text file from a different process.
There are no problems while reading data from API, but I am getting this error while reading the text file:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 907: invalid start byte
The byte being read as '0x96' is actually "–" character in API data and this error occurs only when encoding
argument is explicitly specified as 'utf-8'
. It doesn't occur when encoding
is not explicitly passed to open
function while opening the text file.
My questions:
- Why do we get this error only when
encoding
is specified? I think, we should get the same error in other case as well since default encoding is also 'UTF-8'. (Please correct me if I am wrong) - Is it possible to resolve this issue without changing the way I am reading the text file? (i.e. Can I make any changes to the stage where I am creating this text file from API data?)
Really appreciate you looking into it. Thanks!
CodePudding user response:
In open()
the default encoding is platform dependent, you can find out what is the default for your system by checking what locale.getpreferredencoding()
returns. This is from the documentation
For the 2nd part of your question, since you are not getting an error when you do not specify utf-8
as encoding, you could just use the output for locale.getpreferredencoding()
as the encoding method.
CodePudding user response:
You could do this for each line of the text if you are doing it this way. Since 0x96
is considered a "non-printable".
import re
...
line = re.sub(r'\x96',r'\x2D', line)