How to clear 3 double quotes from some rows of a csv file in python pandas-CodePudding

My csv files in S3 look like this. I want to remove all double quotes from each cells.

f_name , l_name, add, phone

"JOE", "DON", """, "1234"

"JOE", "DON", """, "1234"

"SID", "SIR", "", "456"

"SID", "ABC", "A", "456"

I have read this csv from from pandas

body = object.get()['Body'].read()
reader = pd.read_csv(io.BytesIO(body), delimiter=',', quotechar='"',encoding='utf8', quoting=csv.QUOTE_ALL, skipinitialspace=True)

but rows having 3 """ are causing problem and i am not able to clear/Parse this. Can you please help me out.

Image

CodePudding user response：

As a very basic approach (relying on pandas) you could read the data without providing the quoting option which will ensure having the " chars in your data. These chars can then be removed by applying the .replace() method on each cell item via .applymap():

import io
import pandas as pd


data_string = '''
f_name , l_name, add, phone
"JOE", "DON", """, "1234"
"JOE", "DON", """, "1234"
"SID", "SIR", "", "456"
"SID", "ABC", "A", "456"
'''

print(data_string)

df = pd.read_csv(io.StringIO(data_string), delimiter=',')
print(df)

cleaned = df.applymap(lambda x: x.replace('"', ''))
print(cleaned)

Giving a final output of:

        f_name      l_name  add     phone
0       JOE         DON             1234
1       JOE         DON             1234
2       SID         SIR              456
3       SID         ABC     A        456

CodePudding user response：

If you want to change triple-" into double-" before read in dataframe, you can use regex sub.

import re

body = object.get()['Body'].read()
body = re.sub(b'"""', b'""', body)

Note that search and replace chars are also bytes, as the type of body is bytes.