Pandas - How to Write Invalid Rows to a Text File?-CodePudding

I have a CSV file in which I am using Python to parse. I found that some rows in the file have different number of columns.

001;Snow,Jon;19801201
002;Crom,Jake;19920103
003; ;Wise,Frank;19880303   <-- Invalid row
004;Wiseau,Tommy;4324;1323;2323  <-- Invalid row

I would like to write these invalid rows into a separate text file.

I used this line of code to read from the file.

df = pd.read_csv('names.csv', header=None,sep=';')

One solution I found here was to skip the problematic rows using the following code:

data = pd.read_csv('file1.csv', on_bad_lines='skip')

I could change from 'skip' to 'warn', which will give the row number of the problematic row and skip the row. But this will return warning messages and not the row itself.

CodePudding user response：

You could split the csv file with a script that you run before loading in Pandas. Such as;

with open('names.csv') as src, open('good.csv', 'w') as good, open('bad.csv', 'w') as bad:
    for line in src:
        if line.count(';') == 2: # or any other appropriate criteria
            good.write(line)
        else:
            bad.write(line)

CodePudding user response：

I will suggest to read the file with "open" and parse after that:

with open('test.csv', 'r') as f:
    text = f.readlines()

df = pd.DataFrame(text, columns=['text'])
df = df.text.str.split(';', expand=True)
df.to_csv('fix_me.csv', index=False) # you can either filter out the bad rows by last two columns, or save as it is.

     0             1           2           3       4
0  001      Snow,Jon  19801201          None    None
1  002     Crom,Jake  19920103          None    None
2  003                Wise,Frank    19880303    None
3  004  Wiseau,Tommy        4324        1323    2323