Home > Back-end >  pandas field separator and double quotes
pandas field separator and double quotes

Time:11-24

I need to load in pandas a CSV file that is not 100% CSV "compliant", below an example:

"Transaction date";"Accounting date";"Counterparty's data";"Title"
2021-08-22;2021-08-22;" "SPOLEM" ASS "ALDONA" AUGUSTOW ";" Title 450"
2019-09-02;2019-09-02;" 13XYZ05 "SKOWRONEK" NIEGOWA ";" Title 1300"
2010-07-18;2010-07-18;" APTEKA "SLOWINSKA" SPOLKALEBA ";" Title 123"

I read this csv file (csv_in) into a pandas data frame with the following command:

df = pd.read_csv(csv_in, \
                 delimiter=';', \
                 engine='python', \
                 quoting=1)

I understand that the bad csv formatting is the culprit, but:

  • how can I instruct pandas to indicate at which row the process breakes instead of simply informing me that pandas.errors.ParserError: ';' expected after '"' ... I want to know at which row of the csv_in file it broke ... having a 6500 rows file you can imagine how hard was for me to find these malformed lines without any aid except "hey there is an error!"
  • is it possible to instruct pandas to use the combo [;"] as starting text field and [";] as ending text field? This should solve the issue and apparently is somehow understood by CSV module (import csv) that reads the file without throwing errors, without skipping lines

Thanks! Evan

CodePudding user response:

specify the quotechar and remove the engine attributes (resulting in C being default)


pd.read_csv("csv2.txt", \
                 delimiter=';', \
                 quotechar='"')
    Transaction date    Accounting date     Counterparty's data     Title
0   2021-08-22  2021-08-22  SPOLEM" ASS "ALDONA" AUGUSTOW "     Title 450
1   2019-09-02  2019-09-02  13XYZ05 SKOWRONEK" NIEGOWA "    Title 1300
2   2010-07-18  2010-07-18  APTEKA SLOWINSKA" SPOLKALEBA "  Title 123

CodePudding user response:

You can try to clean your data by escaping the inner quotes " by \:

import re
import pandas as pd
from io import StringIO


with open("your_data.csv", "r") as f_in:
    data = f_in.read()

data = re.sub(
    r'"([^;] )"',
    lambda g: '"'   g.group(1).replace('"', r"\"").strip()   '"',
    data,
)

df = pd.read_csv(StringIO(data), delimiter=";", quotechar='"', escapechar="\\")
print(df)

This prints:

  Transaction date Accounting date             Counterparty's data       Title
0       2021-08-22      2021-08-22  "SPOLEM" ASS "ALDONA" AUGUSTOW   Title 450
1       2019-09-02      2019-09-02     13XYZ05 "SKOWRONEK" NIEGOWA  Title 1300
2       2010-07-18      2010-07-18   APTEKA "SLOWINSKA" SPOLKALEBA   Title 123
  • Related