This is a small example of a big csv file. Imagine this csv file contais a text like this one below:
import io
import pandas as pd
text="""A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P
"2019,01/2019,Brazil,0,""166,229"",0,""22578,86292"",""47,417"",,,,,,,,"
"2019,01/2019,Brazil,95,0,""50,34563"",0,""5,137"",,,,,,,,"
2019,01/2019,Brazil,0,0,0,0,0,,,,,,,,"""
I would like to read it as a dataframe and the first line should be the column names. My problem occurs with the lines 2 and 3 because of the quotation marks separating float numbers. If I use the code below, the last line is perfect because there aren't any quotation marks separating the numbers.
df = pd.read_csv(io.StringIO(text), sep=',', engine="python", encoding='utf-8', decimal=",")
df
Out[73]:
A B ... O P
0 2019,01/2019,Brazil,0,"166,229",0,"22578,86292... None ... NaN NaN
1 2019,01/2019,Brazil,95,0,"50,34563",0,"5,137",... None ... NaN NaN
2 2019 01/2019 ... NaN NaN
[3 rows x 16 columns]
Could you show me how can I read this properly, separating each value by column? Thanks
CodePudding user response:
First I would check how the CSV file is generated and fix the root cause (why are there the quotation marks?) If this isn't possible, try to remove all quotation marks:
text = """A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P
"2019,01/2019,Brazil,0,""166,229"",0,""22578,86292"",""47,417"",,,,,,,,"
"2019,01/2019,Brazil,95,0,""50,34563"",0,""5,137"",,,,,,,,"
2019,01/2019,Brazil,0,0,0,0,0,,,,,,,,"""
import re
text = re.sub(r'""(\d ,?\d*)""', lambda g: g.group(1).replace(",", "."), text)
df = pd.read_csv(
io.StringIO(text.replace('"', "")),
sep=",",
engine="python",
encoding="utf-8",
)
print(df)
This prints:
A B C D E F G H I J K L M N O P
0 2019 01/2019 Brazil 0 166.229 0.00000 22578.86292 47.417 NaN NaN NaN NaN NaN NaN NaN NaN
1 2019 01/2019 Brazil 95 0.000 50.34563 0.00000 5.137 NaN NaN NaN NaN NaN NaN NaN NaN
2 2019 01/2019 Brazil 0 0.000 0.00000 0.00000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN