List all Pandas ParserError when using pd.read

I have a csv with multiple lines that produce the following error:

df1 = pd.read_csv('df1.csv',  skiprows=[176, 2009, 2483, 3432, 7486, 7608, 7990, 11992, 12421])

ParserError: Error tokenizing data. C error: Expected 52 fields in line 12541, saw 501

As you can probably notice, I have multiple lines that produce a ParserError.

To work around this, I am just updating 'skiprows' to include the error and continue parsing the csv. I have over 30K lines and would prefer to just do this all at once rather than hitting run in Jupyter Notebook, getting a new error, and updating. Otherwise, I wish it would just skip the errors and parse the rest, I've tried googling a solution that way - but all the SO responses were too complicated for me to follow and reproduce for my data structures.

P.S. why is that when using skiprows with just 1 line, like 177, I can just enter skiprows = 177, but when using skiprows with a list, I have to do skiprows = 'errored line - 1'? Why does the counting change?

CodePudding user response：

pandas ≥ 1.3

You should use the on_bad_lines parameter of read_csv (pandas ≥ 1.3.0)

df1 = pd.read_csv('df1.csv', on_bad_lines='warn')

This will skip the invalid lines and give you a warning. If you use on_bad_lines='skip' you skip the lines without warning. The default value of on_bad_lines='error' raises an error for the first issue and aborts.

pandas < 1.3

The parameters are error_bad_lines=False and warn_bad_lines=True.