How to avoid quote error when loading a csv file in Pandas-CodePudding

I've been working on this problem trying to figure it out. I receive some csv data which fields are all enclosed by double quotes ". The problem is that in some cells, I have double quotes inside the string, so when I tried to upload the data using read_csv of pandas I receive an error. I know that I can skip those rows but I would like to upload them and work with the whole data.

The dataset is something like:

"SG126AS01772","2021-07-06","030046","STARTER"
"SCG12SDF4163","2021-09-27","146054","ISKCON - TEMPLE"
"SPH1108SD964","2020-07-10","075825","MICHAEL MARANO PANEL 48""
"SNA11801SD11","2021-04-20","033090","33 MAPLE AVE "PANELBOARDS"  NEWARK,"
"SAG146ZBC026","2020-12-08","270216","GRANT AMPD"

Look at the las field of rows 3 and 4.

Is there any way to work around it? Or it is something that definitely needs to be manually fixed.

CodePudding user response：

This would help deleting the double quotes. But I think there are some other problems with your data.

with open('path_to_your_file.csv', 'r') as f:
    string = f.read()
    string = string.replace('\"\"', '\"')
with open('path_to_your_file.csv', 'w') as f:
    f.write(string)

CodePudding user response：

It looks like only your last line is going to ever be a problem, so we can parse the file ourselves:

data = []
with open('file.csv') as f:
    for line in f:
        line = line.strip().split(',', 3) # Only split first 3 commas.
        line = [x[1:-1] for x in line]
        data.append(line)

df = pd.DataFrame(data)
print(df)

# Output:
              0           1       2                                    3
0  SG126AS01772  2021-07-06  030046                              STARTER
1  SCG12SDF4163  2021-09-27  146054                      ISKCON - TEMPLE
2  SPH1108SD964  2020-07-10  075825             MICHAEL MARANO PANEL 48"
3  SNA11801SD11  2021-04-20  033090  33 MAPLE AVE "PANELBOARDS"  NEWARK,
4  SAG146ZBC026  2020-12-08  270216                           GRANT AMPD

CodePudding user response：

Using Pandas read_csv

Code

df = (pd.read_csv(StringIO(s), 
                sep = r'(?!\B\"[^\"]*),(?![^\"]*\"\B)', # separator is comma not inside double quotes
                 header = None,
                 engine = "python")
      .replace(r'^\"(.*)\"$', r"\1", regex = True))  # Drop quote pair at begin/end of values

display(df)                                          # show result

Explanation

sep = r'(?!\B"[^"]),(?![^"]"\B) tells read_csv to use commas not enclosed in parens as separator
replace(r'^"(.*)"$', r"\1", regex = True) drops doublequote surrounding values in dataframe

Input

File: test.csv contains data in posted question

Output

               0             1       2        3
0   SG126AS01772    2021-07-06  030046  STARTER
1   SCG12SDF4163    2021-09-27  146054  ISKCON - TEMPLE
2   SPH1108SD964    2020-07-10  075825  MICHAEL MARANO PANEL 48"
3   SNA11801SD11    2021-04-20  033090  33 MAPLE AVE "PANELBOARDS" NEWARK,
4   SAG146ZBC026    2020-12-08  270216  GRANT AMPD