I need to load in pandas a CSV file that is not 100% CSV "compliant", below an example:
"Transaction date";"Accounting date";"Counterparty's data";"Title"
2021-08-22;2021-08-22;" "SPOLEM" ASS "ALDONA" AUGUSTOW ";" Title 450"
2019-09-02;2019-09-02;" 13XYZ05 "SKOWRONEK" NIEGOWA ";" Title 1300"
2010-07-18;2010-07-18;" APTEKA "SLOWINSKA" SPOLKALEBA ";" Title 123"
I read this csv file (csv_in) into a pandas data frame with the following command:
df = pd.read_csv(csv_in, \
delimiter=';', \
engine='python', \
quoting=1)
I understand that the bad csv formatting is the culprit, but:
- how can I instruct pandas to indicate at which row the process breakes instead of simply informing me that pandas.errors.ParserError: ';' expected after '"' ... I want to know at which row of the csv_in file it broke ... having a 6500 rows file you can imagine how hard was for me to find these malformed lines without any aid except "hey there is an error!"
- is it possible to instruct pandas to use the combo [;"] as starting text field and [";] as ending text field? This should solve the issue and apparently is somehow understood by CSV module (import csv) that reads the file without throwing errors, without skipping lines
Thanks! Evan
CodePudding user response:
specify the quotechar and remove the engine attributes (resulting in C being default)
pd.read_csv("csv2.txt", \
delimiter=';', \
quotechar='"')
Transaction date Accounting date Counterparty's data Title
0 2021-08-22 2021-08-22 SPOLEM" ASS "ALDONA" AUGUSTOW " Title 450
1 2019-09-02 2019-09-02 13XYZ05 SKOWRONEK" NIEGOWA " Title 1300
2 2010-07-18 2010-07-18 APTEKA SLOWINSKA" SPOLKALEBA " Title 123
CodePudding user response:
You can try to clean your data by escaping the inner quotes "
by \
:
import re
import pandas as pd
from io import StringIO
with open("your_data.csv", "r") as f_in:
data = f_in.read()
data = re.sub(
r'"([^;] )"',
lambda g: '"' g.group(1).replace('"', r"\"").strip() '"',
data,
)
df = pd.read_csv(StringIO(data), delimiter=";", quotechar='"', escapechar="\\")
print(df)
This prints:
Transaction date Accounting date Counterparty's data Title
0 2021-08-22 2021-08-22 "SPOLEM" ASS "ALDONA" AUGUSTOW Title 450
1 2019-09-02 2019-09-02 13XYZ05 "SKOWRONEK" NIEGOWA Title 1300
2 2010-07-18 2010-07-18 APTEKA "SLOWINSKA" SPOLKALEBA Title 123