My csv files in S3 look like this. I want to remove all double quotes from each cells.
f_name , l_name, add, phone
"JOE", "DON", """, "1234"
"JOE", "DON", """, "1234"
"SID", "SIR", "", "456"
"SID", "ABC", "A", "456"
I have read this csv from from pandas
body = object.get()['Body'].read()
reader = pd.read_csv(io.BytesIO(body), delimiter=',', quotechar='"',encoding='utf8', quoting=csv.QUOTE_ALL, skipinitialspace=True)
but rows having 3 """ are causing problem and i am not able to clear/Parse this. Can you please help me out.
CodePudding user response:
As a very basic approach (relying on pandas) you could read the data without providing the quoting option which will ensure having the "
chars in your data. These chars can then be removed by applying the .replace()
method on each cell item via .applymap()
:
import io
import pandas as pd
data_string = '''
f_name , l_name, add, phone
"JOE", "DON", """, "1234"
"JOE", "DON", """, "1234"
"SID", "SIR", "", "456"
"SID", "ABC", "A", "456"
'''
print(data_string)
df = pd.read_csv(io.StringIO(data_string), delimiter=',')
print(df)
cleaned = df.applymap(lambda x: x.replace('"', ''))
print(cleaned)
Giving a final output of:
f_name l_name add phone
0 JOE DON 1234
1 JOE DON 1234
2 SID SIR 456
3 SID ABC A 456
CodePudding user response:
If you want to change triple-" into double-" before read in dataframe, you can use regex sub.
import re
body = object.get()['Body'].read()
body = re.sub(b'"""', b'""', body)
Note that search and replace chars are also bytes, as the type of body
is bytes.