Home > Software design >  How to avoid quote error when loading a csv file in Pandas
How to avoid quote error when loading a csv file in Pandas

Time:08-10

I've been working on this problem trying to figure it out. I receive some csv data which fields are all enclosed by double quotes ". The problem is that in some cells, I have double quotes inside the string, so when I tried to upload the data using read_csv of pandas I receive an error. I know that I can skip those rows but I would like to upload them and work with the whole data.

The dataset is something like:

"SG126AS01772","2021-07-06","030046","STARTER"
"SCG12SDF4163","2021-09-27","146054","ISKCON - TEMPLE"
"SPH1108SD964","2020-07-10","075825","MICHAEL MARANO PANEL 48""
"SNA11801SD11","2021-04-20","033090","33 MAPLE AVE "PANELBOARDS"  NEWARK,"
"SAG146ZBC026","2020-12-08","270216","GRANT AMPD"

Look at the las field of rows 3 and 4.

Is there any way to work around it? Or it is something that definitely needs to be manually fixed.

CodePudding user response:

This would help deleting the double quotes. But I think there are some other problems with your data.

with open('path_to_your_file.csv', 'r') as f:
    string = f.read()
    string = string.replace('\"\"', '\"')
with open('path_to_your_file.csv', 'w') as f:
    f.write(string)

CodePudding user response:

It looks like only your last line is going to ever be a problem, so we can parse the file ourselves:

data = []
with open('file.csv') as f:
    for line in f:
        line = line.strip().split(',', 3) # Only split first 3 commas.
        line = [x[1:-1] for x in line]
        data.append(line)

df = pd.DataFrame(data)
print(df)

# Output:
              0           1       2                                    3
0  SG126AS01772  2021-07-06  030046                              STARTER
1  SCG12SDF4163  2021-09-27  146054                      ISKCON - TEMPLE
2  SPH1108SD964  2020-07-10  075825             MICHAEL MARANO PANEL 48"
3  SNA11801SD11  2021-04-20  033090  33 MAPLE AVE "PANELBOARDS"  NEWARK,
4  SAG146ZBC026  2020-12-08  270216                           GRANT AMPD

CodePudding user response:

Using Pandas read_csv

Code

df = (pd.read_csv(StringIO(s), 
                sep = r'(?!\B\"[^\"]*),(?![^\"]*\"\B)', # separator is comma not inside double quotes
                 header = None,
                 engine = "python")
      .replace(r'^\"(.*)\"$', r"\1", regex = True))  # Drop quote pair at begin/end of values

display(df)                                          # show result

Explanation

  • sep = r'(?!\B"[^"]),(?![^"]"\B) tells read_csv to use commas not enclosed in parens as separator
  • replace(r'^"(.*)"$', r"\1", regex = True) drops doublequote surrounding values in dataframe

Input

File: test.csv contains data in posted question

Output

               0             1       2        3
0   SG126AS01772    2021-07-06  030046  STARTER
1   SCG12SDF4163    2021-09-27  146054  ISKCON - TEMPLE
2   SPH1108SD964    2020-07-10  075825  MICHAEL MARANO PANEL 48"
3   SNA11801SD11    2021-04-20  033090  33 MAPLE AVE "PANELBOARDS" NEWARK,
4   SAG146ZBC026    2020-12-08  270216  GRANT AMPD
  • Related