I have a CSV file in which I am using Python to parse. I found that some rows in the file have different number of columns.
001;Snow,Jon;19801201
002;Crom,Jake;19920103
003; ;Wise,Frank;19880303 <-- Invalid row
004;Wiseau,Tommy;4324;1323;2323 <-- Invalid row
I would like to write these invalid rows into a separate text file.
I used this line of code to read from the file.
df = pd.read_csv('names.csv', header=None,sep=';')
One solution I found here was to skip the problematic rows using the following code:
data = pd.read_csv('file1.csv', on_bad_lines='skip')
I could change from 'skip' to 'warn', which will give the row number of the problematic row and skip the row. But this will return warning messages and not the row itself.
CodePudding user response:
You could split the csv file with a script that you run before loading in Pandas. Such as;
with open('names.csv') as src, open('good.csv', 'w') as good, open('bad.csv', 'w') as bad:
for line in src:
if line.count(';') == 2: # or any other appropriate criteria
good.write(line)
else:
bad.write(line)
CodePudding user response:
I will suggest to read the file with "open" and parse after that:
with open('test.csv', 'r') as f:
text = f.readlines()
df = pd.DataFrame(text, columns=['text'])
df = df.text.str.split(';', expand=True)
df.to_csv('fix_me.csv', index=False) # you can either filter out the bad rows by last two columns, or save as it is.
0 1 2 3 4
0 001 Snow,Jon 19801201 None None
1 002 Crom,Jake 19920103 None None
2 003 Wise,Frank 19880303 None
3 004 Wiseau,Tommy 4324 1323 2323