I am using pandas to read in csv data to my python script. Both csv files have the same encoding (Windows-1252). However with one of the files I get an error when reading the csv file with pandas, unless I specify the encoding parameters in pd.read_csv().
Does anyone know why I need to specify the encoding in one csv and not the other? Both csv's contain similar data (strings and numbers).
Thank you
CodePudding user response:
That just means that one of the files has a character outside the range 0x00 to 0x7F. It's only the highest 128 values where the encoding makes a difference. All it takes is one n-with-tilde or one smart quote mark.
CodePudding user response:
Pandas (at least version 1.3.3) uses UTF-8 encoding by default, even on Windows (see the source code). UTF-8 has some forbidden bytes (see the red cells in the codepage layout). However, these bytes are allowed in Windows-1252. Therefore, I suppose one of your files has some of these bytes that are not allowed in UTF-8. Perhaps there is a data entry error that mistakenly put a ø instead of 0.