How to remove double quotes and ';' from header in PySpark-CodePudding

I am trying to remove "" and ; from my CSV file in PySpark. The data in CSV looks like below:

age;"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"

Code I am using is:

df = spark.read.options(delimiter=';').csv("C:/Project_bankdata.csv", header=True)
df1 = df.select([F.regexp_replace(c, '"', '').alias(c) for c in df.columns])
df1.show(10,truncate=0)

Output:

|"age;""job""   |""marital""|""education""|""default""|""balance""|""housing""|""loan""|""contact""|""day""|""month""|""duration""|""campaign""|""pdays""|""previous""|""poutcome""|""y"""|
 --------------- ----------- ------------- ----------- ----------- ----------- -------- ----------- ------- --------- ------------ ------------ --------- ------------ ------------ ------ 
|58;management  |married    |tertiary     |no         |2143       |yes        |no      |unknown    |5      |may      |261         |1           |-1       |0           |unknown     |no    |

I am able to get rid of quotes from data, but not from the header. How can I remove double quotes from header as well?

CodePudding user response：

this is dataframe output that I am getting when running your code and using the csv data you posted. I am using spark version 2.3.2

 --- ---------- ------- --------- ------- ------- ------- ---- ------- --- ----- -------- -------- ----- -------- -------- --- 
|age|job       |marital|education|default|balance|housing|loan|contact|day|month|duration|campaign|pdays|previous|poutcome|y  |
 --- ---------- ------- --------- ------- ------- ------- ---- ------- --- ----- -------- -------- ----- -------- -------- --- 
|58 |management|married|tertiary |no     |2143   |yes    |no  |unknown|5  |may  |261     |1       |-1   |0       |unknown |no |
 --- ---------- ------- --------- ------- ------- ------- ---- ------- --- ----- -------- -------- ----- -------- -------- ---

CodePudding user response：

I was only able to reproduce your output if I used this input CSV:

"age;""job"";""marital"";""education"";""default"";""balance"";""housing"";""loan"";""contact"";""day"";""month"";""duration"";""campaign"";""pdays"";""previous"";""poutcome"";""y"""
"58;"management"";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"

You can read the CSV as text file, remove all the double quotes " from every line and then make a dataframe.

rdd = spark.sparkContext.textFile(r"C:\temp\temp.csv")
rdd = rdd.map(lambda line: line.replace('"', '').split(';'))

header = rdd.first()
df = rdd.filter(lambda line: line != header).toDF(header)

df.show()
#  --- ---------- ------- --------- ------- ------- ------- ---- ------- --- ----- -------- -------- ----- -------- -------- --- 
# |age|       job|marital|education|default|balance|housing|loan|contact|day|month|duration|campaign|pdays|previous|poutcome|  y|
#  --- ---------- ------- --------- ------- ------- ------- ---- ------- --- ----- -------- -------- ----- -------- -------- --- 
# | 58|management|married| tertiary|     no|   2143|    yes|  no|unknown|  5|  may|     261|       1|   -1|       0| unknown| no|
#  --- ---------- ------- --------- ------- ------- ------- ---- ------- --- ----- -------- -------- ----- -------- -------- ---

Note. This effectively removes string notation from the CSV file. So, this will only work well, if you don't have such values which contain ; inside them.