Home > Mobile >  Pandas read_csv - Error tokenizing data after modifying Excel .csv file
Pandas read_csv - Error tokenizing data after modifying Excel .csv file

Time:11-21

I have a CSV dataset for an ML classifier. It has 2 columns and looks like this:

enter image description here

But this dataset is very dirty, so I decided to open it with Excel, remove "dirty" words, and save it as a new CSV file and train my ML classifier on it.

But after I saved it in Excel (using , separator and also tried , UTF-8), and when trying pd.read_csv on it, it gives me this error:

Error tokenizing data. C error: Expected 3 fields in line 4, saw 5

Then I tried to use sep=';' with read_csv, and it worked, but now all Russian characters are replaced with strange symbols:

enter image description here

Can somebody explain please how to repair "question"-symbols from Russian characters? encoding='UTF-8' gives this error:

'utf-8' codec can't decode byte 0xe6 in position 22: invalid continuation byte

This is what the first file looks like (not modified Excel .csv file):

enter image description here

When I open second file (modified):

enter image description here

CodePudding user response:

Try opening the file with either ptcp154 or kz1048 encodings. They seem to work.

  • Related