I was wondering why I get funny behaviour using a csv file that has been "changed" in excel.
I have a csv file of around 211,029 rows and pass this csv file into pandas using a Jupyter-notebook
The simplest example I can give of a change is simply clicking on the filter icon in excel saving the file, unclicking the filter icon and saving again (making no physical changes in the data).
When I pass my csv file through pandas, after a few filter operations, some rows go missing.
This is in comparison to that of doing absolutely nothing with the csv file. Leaving the csv file completely alone gives me the correct number of rows I need after filtering compared to "making changes" to the csv file.
Why is this? Is it because of the number of rows in a csv file? Are we supposed to leave csv files untouched if we are planning to filter through pandas anyways?
(As a side note I'm using Excel on a MacBook.)
CodePudding user response:
Excel does not leave any file "untouched". It applies formatting to every file it opens (e.g. float values like "5.06" will be interpreted as date and changed to "05 Jun"). Depending on the expected datatype these rows might be displayed wrongly or missing in your notebook. Better use sed or awk to manipulate csv files (or a text editor for smaller files).