I have a csv with multiple lines that produce the following error:
df1 = pd.read_csv('df1.csv', skiprows=[176, 2009, 2483, 3432, 7486, 7608, 7990, 11992, 12421])
ParserError: Error tokenizing data. C error: Expected 52 fields in line 12541, saw 501
As you can probably notice, I have multiple lines that produce a ParserError.
To work around this, I am just updating 'skiprows' to include the error and continue parsing the csv. I have over 30K lines and would prefer to just do this all at once rather than hitting run in Jupyter Notebook, getting a new error, and updating. Otherwise, I wish it would just skip the errors and parse the rest, I've tried googling a solution that way - but all the SO responses were too complicated for me to follow and reproduce for my data structures.
P.S. why is that when using skiprows with just 1 line, like 177, I can just enter skiprows = 177, but when using skiprows with a list, I have to do skiprows = 'errored line - 1'? Why does the counting change?
CodePudding user response:
pandas ≥ 1.3
You should use the on_bad_lines
parameter of read_csv
(pandas ≥ 1.3.0)
df1 = pd.read_csv('df1.csv', on_bad_lines='warn')
This will skip the invalid lines and give you a warning. If you use on_bad_lines='skip'
you skip the lines without warning. The default value of on_bad_lines='error'
raises an error for the first issue and aborts.
pandas < 1.3
The parameters are error_bad_lines=False
and warn_bad_lines=True
.