Repair or drop rows with extra columns in pandas-CodePudding

I have csv file with 1000 rows.

Each row is supposed to have 5 columns. But few rows have 6 columns due to "\" in the dataset as explained below exaple.

Input data: it looks like below when I open csv file in a text editor.

ID, name, description, qty, price 
1, pen, good quality, 4, 5.50
2, book, Good read and\, must, 5, 10
3, PENCIL, good, 5, 100
4, book2, Good read and\, must, 10, 100

I want to remove "\" from the file and make all the rows as 5 column data as below.

Required output1:

ID, name, description, qty, price 
1, pen, good quality, 4, 5.50
2, book, Good read and must, 5, 10
3, PENCIL, good, 5, 100
4, book2, Good read and must, 10, 100

If the above is not possible, I would like to drop the problematic rows which has extra columns(6 rows) and finally make the all the rows as 5 column data.

Required output2:

ID, name, description, qty, price 
1, pen, good quality, 4, 5.50
3, PENCIL, good, 5, 100

Thanks.

CodePudding user response：

We can use the replace method like so :

data = """
ID, name, description, qty, price 
1, pen, good quality, 4, 5.50
2, book, Good read and\, must, 5, 10
3, PENCIL, good, 5, 100
4, book2, Good read and\, must, 10, 100""".replace("\,", "")

Then using StringIO :

>>> import pandas as pd
>>> from io import StringIO

>>> df = pd.read_csv(StringIO(data), sep=',')
>>> df
   ID   name    description         qty price
0   1   pen     good quality        4   5.5
1   2   book    Good read and must  5   10.0
2   3   PENCIL  good                5   100.0
3   4   book2   Good read and must  10  100.0

CodePudding user response：

To clean the file:

read the file
replace "\," with empty string ""
write back to file
read to DataFrame if needed

with open("data.csv") as file:
    contents = file.read().replace("\\,","")

with open("output.csv", "w ") as file:
    file.write(contents)

#if you want to read to a DataFrame
df = pd.read_csv("output.csv")

>>> df
   ID     name          description   qty   price 
0   1      pen         good quality     4      5.5
1   2     book   Good read and must     5     10.0
2   3   PENCIL                 good     5    100.0
3   4    book2   Good read and must    10    100.0

output.csv:

ID, name, description, qty, price 
1, pen, good quality, 4, 5.50
2, book, Good read and must, 5, 10
3, PENCIL, good, 5, 100
4, book2, Good read and must, 10, 100

To drop problematic rows:

df = pd.read_csv("data.csv", error_bad_lines=False)

>>> df
   ID     name    description   qty   price 
0   1      pen   good quality     4      5.5
1   3   PENCIL           good     5    100.0

CodePudding user response：

By the way what is making 6 rows is the ',' not \ because it is a Comma Separated File (CSV). To fix this issue you will need to remove all the commas that come after a backslash in your data.

with open('file.csv', 'r') as f:
    file = f.read()

file = file.replace("\,", "")

with open('new_file.csv', 'w') as wf:
    wf.write(file)