I've been working on this problem trying to figure it out. I receive some csv data which fields are all enclosed by double quotes ". The problem is that in some cells, I have double quotes inside the string, so when I tried to upload the data using read_csv of pandas I receive an error. I know that I can skip those rows but I would like to upload them and work with the whole data.
The dataset is something like:
"SG126AS01772","2021-07-06","030046","STARTER"
"SCG12SDF4163","2021-09-27","146054","ISKCON - TEMPLE"
"SPH1108SD964","2020-07-10","075825","MICHAEL MARANO PANEL 48""
"SNA11801SD11","2021-04-20","033090","33 MAPLE AVE "PANELBOARDS" NEWARK,"
"SAG146ZBC026","2020-12-08","270216","GRANT AMPD"
Look at the las field of rows 3 and 4.
Is there any way to work around it? Or it is something that definitely needs to be manually fixed.
CodePudding user response:
This would help deleting the double quotes. But I think there are some other problems with your data.
with open('path_to_your_file.csv', 'r') as f:
string = f.read()
string = string.replace('\"\"', '\"')
with open('path_to_your_file.csv', 'w') as f:
f.write(string)
CodePudding user response:
It looks like only your last line is going to ever be a problem, so we can parse the file ourselves:
data = []
with open('file.csv') as f:
for line in f:
line = line.strip().split(',', 3) # Only split first 3 commas.
line = [x[1:-1] for x in line]
data.append(line)
df = pd.DataFrame(data)
print(df)
# Output:
0 1 2 3
0 SG126AS01772 2021-07-06 030046 STARTER
1 SCG12SDF4163 2021-09-27 146054 ISKCON - TEMPLE
2 SPH1108SD964 2020-07-10 075825 MICHAEL MARANO PANEL 48"
3 SNA11801SD11 2021-04-20 033090 33 MAPLE AVE "PANELBOARDS" NEWARK,
4 SAG146ZBC026 2020-12-08 270216 GRANT AMPD
CodePudding user response:
Using Pandas read_csv
Code
df = (pd.read_csv(StringIO(s),
sep = r'(?!\B\"[^\"]*),(?![^\"]*\"\B)', # separator is comma not inside double quotes
header = None,
engine = "python")
.replace(r'^\"(.*)\"$', r"\1", regex = True)) # Drop quote pair at begin/end of values
display(df) # show result
Explanation
- sep = r'(?!\B"[^"]),(?![^"]"\B) tells read_csv to use commas not enclosed in parens as separator
- replace(r'^"(.*)"$', r"\1", regex = True) drops doublequote surrounding values in dataframe
Input
File: test.csv contains data in posted question
Output
0 1 2 3
0 SG126AS01772 2021-07-06 030046 STARTER
1 SCG12SDF4163 2021-09-27 146054 ISKCON - TEMPLE
2 SPH1108SD964 2020-07-10 075825 MICHAEL MARANO PANEL 48"
3 SNA11801SD11 2021-04-20 033090 33 MAPLE AVE "PANELBOARDS" NEWARK,
4 SAG146ZBC026 2020-12-08 270216 GRANT AMPD