Working on putting csv files into a dataframe and loading it into a database, but there are a few files that have rows with data in extra columns. I want to be able to drop the entire row that exceeds 'X' amount of columns
Example, want to remove rows that have more than 4 columns
id, header1, header2, header3
1, desc1, desc2, desc3
2, desc1, desc2, desc3
3, desc1, desc2, desc3, desc4
4, desc1, desc2, desc3
I would like to skip row with id = 3, output:
id, header1, header2, header3
1, desc1, desc2, desc3
2, desc1, desc2, desc3
4, desc1, desc2, desc3
CodePudding user response:
Pandas can do it for you, use on_bad_lines='skip'
as parameter of read_csv
:
df = pd.read_csv('data.csv', on_bad_lines='skip')
print(df)
# Output
id header1 header2 header3
0 1 desc1 desc2 desc3
1 2 desc1 desc2 desc3
2 4 desc1 desc2 desc3
Update
For older versions of Pandas, use:
df = pd.read_csv('data.csv', error_bad_lines=False)
print(df)
# Output
id header1 header2 header3
0 1 desc1 desc2 desc3
1 2 desc1 desc2 desc3
2 4 desc1 desc2 desc3