I am using a Youtube channel to learn machine learning algorithms. Somewhere in this video, I encountered an argument inputted into pd.read_csv method called encoding='latin-1'. I surfed the net but nothing was found. I wanted to know the function of this argument.
Thank you for your time.
CodePudding user response:
https://en.wikipedia.org/wiki/ISO/IEC_8859-1
Latin-1 is the same as 8859-1. Every character is encoded as a single byte. There are 191 characters total.
CodePudding user response:
Here the underlying reason the for "encoding=" parameter.
English speakers live in an easy world where the number of necessary characters to write any kind of text or computer code is small enough to be stored in a 8-bit byte (even on a 7-bit, btw, but that's not the point). Therefore, 1 character = 1 byte, everybody agrees on the meaning of each one of the 256 possible 8-bit values.
Many other languages, even those who use the same latin alphabet, need all kinds of accented letters and specialties that do not exist in English. In addition, all special characters of all those languages don't fit into 256 different byte values. Historically, every language community has decided on a specific encoding for all byte values above 127. latin-1, aka iso-8859-1, is one of those encodings, but as you may guess, not the only one. This doesn't scale well, of course, and won't work for languages that don't use latin alphabet and need far over 256 different values.
In all modern languages, a character and a byte are two different things. (read this sentence twice or more, and commit into permanent brain memory)
The computer can in no way "guess" which is the encoding of a byte stream (like a csv file) that you feed it for processing as text (= strings of characters). Therefore, a function that reads files (I didn't watch the video, but the name of the function is explicit enough to understand its purpose) has to convert bytes (on the disk) into characters (in memory, using whatever internal representation the language happens to use). Inversely, when you have to write something to the disk or the network, which can only accept 8-bit bytes, you have to convert back your characters into bytes.
Those conversions are performed using the particular encoding your file/byte stream/network protocol is using.
As a side note, you should consider getting rid of 8859-* encoding and using unicode, and the utf-8 encoding, as much as possible in new developments.