I am trying to import a very large .csv file as:
import dask.dataframe as dd
import pandas as pd
dd_subf1_small = dd.read_csv('subf1_small.csv', dtype={'Unnamed: 0': 'float64','oecd_subfield':'object','paperid':'object'}, sep=None, engine = 'python').persist()
but I am getting the following error:
ParserError Traceback (most recent call last)
ParserError: Expected 3 fields in line 1811036, saw 5
Actually i don't know how the data are made as the csv file is 36gb and did not manage to open. I saw another question where the erro was passing header=None which I am not doing.
How can I avoid the above error?
CodePudding user response:
As the error says, your CSV file probably contains rows with 5 values instead of 3.
You have two options:
- Found those rows and fix/remove them from the file. This might be challenging given the file is huge.
- use paramter
to let pandas skip them and continue loading the file.
Learn more about on_bad_lines
here: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Also, I noticed you are using sep=None
. Why? are the values in each row seperated by nothing? that doesn't make sense. The default (and most common delimiter (aka separator) is comma (,)). Post here an example of 3 lines from the file so I could assist with that.