I have csv file with 1000 rows.
Each row is supposed to have 5 columns. But few rows have 6 columns due to "\" in the dataset as explained below exaple.
Input data: it looks like below when I open csv file in a text editor.
ID, name, description, qty, price
1, pen, good quality, 4, 5.50
2, book, Good read and\, must, 5, 10
3, PENCIL, good, 5, 100
4, book2, Good read and\, must, 10, 100
I want to remove "\" from the file and make all the rows as 5 column data as below.
Required output1:
ID, name, description, qty, price
1, pen, good quality, 4, 5.50
2, book, Good read and must, 5, 10
3, PENCIL, good, 5, 100
4, book2, Good read and must, 10, 100
If the above is not possible, I would like to drop the problematic rows which has extra columns(6 rows) and finally make the all the rows as 5 column data.
Required output2:
ID, name, description, qty, price
1, pen, good quality, 4, 5.50
3, PENCIL, good, 5, 100
Thanks.
CodePudding user response:
We can use the replace
method like so :
data = """
ID, name, description, qty, price
1, pen, good quality, 4, 5.50
2, book, Good read and\, must, 5, 10
3, PENCIL, good, 5, 100
4, book2, Good read and\, must, 10, 100""".replace("\,", "")
Then using StringIO
:
>>> import pandas as pd
>>> from io import StringIO
>>> df = pd.read_csv(StringIO(data), sep=',')
>>> df
ID name description qty price
0 1 pen good quality 4 5.5
1 2 book Good read and must 5 10.0
2 3 PENCIL good 5 100.0
3 4 book2 Good read and must 10 100.0
CodePudding user response:
To clean the file:
- read the file
replace
"\," with empty string ""- write back to file
- read to DataFrame if needed
with open("data.csv") as file:
contents = file.read().replace("\\,","")
with open("output.csv", "w ") as file:
file.write(contents)
#if you want to read to a DataFrame
df = pd.read_csv("output.csv")
>>> df
ID name description qty price
0 1 pen good quality 4 5.5
1 2 book Good read and must 5 10.0
2 3 PENCIL good 5 100.0
3 4 book2 Good read and must 10 100.0
output.csv:
ID, name, description, qty, price
1, pen, good quality, 4, 5.50
2, book, Good read and must, 5, 10
3, PENCIL, good, 5, 100
4, book2, Good read and must, 10, 100
To drop problematic rows:
df = pd.read_csv("data.csv", error_bad_lines=False)
>>> df
ID name description qty price
0 1 pen good quality 4 5.5
1 3 PENCIL good 5 100.0
CodePudding user response:
By the way what is making 6 rows is the ',' not \ because it is a Comma Separated File (CSV). To fix this issue you will need to remove all the commas that come after a backslash in your data.
with open('file.csv', 'r') as f:
file = f.read()
file = file.replace("\,", "")
with open('new_file.csv', 'w') as wf:
wf.write(file)